Rohan Paul@rohanpaul_ai

2026-05-21 11:50·43天前

AI 摘要

atomic.chat的MTP（多Token预测）技术通过一次验证多个草稿token，有效减少了GPU重复读取模型权重的次数，显著提升了本地大模型的推理速度。测试显示，27B密集模型的速度从51 token/s提升至117 token/s，提升约137%；35B MoE模型在2x RTX 5090上速度提升约25%。该技术实现了约80%的草稿接受率，无精度损失，仅需额外约1GB显存。由于密集模型需要读取全部参数，其从该技术中获益更大。此项目已开源。

Another good news for local-LLM from atomic【.】chat， that runs 100% offline on your computer.

They just showed MTP （Multi-Token Prediction） pushing local Qwen models from 51 to 117 tokens/s on dense 27B.

And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090.

Instead of generating and checking one token at a time， MTP （Multi-Token Prediction） drafts multiple future tokens and verifies them together， so the GPU does less repeated work for every word it prints.

And this makes local LLMs much faster when the draft tokens are accepted often enough.

For many local LLM runs， the limit is not pure compute， but memory bandwidth： how fast the GPU can keep feeding weights into computation.

A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token， so if MTP lets the model check several drafted tokens in one forward pass， it reduces how often the same giant weight matrix has to be reread.

The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM， because speculative decoding often becomes useful only when the draft tokens are accepted often enough.

So we get this strong local AI result because it improves generation speed without changing the model's answers， but the dense model is the real winner because memory bandwidth was its main bottleneck.

Their GitHub repo is fully open source.

atomic.chatMTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25% MTP drafts severa...

Rohan Paul@rohanpaul_ai · X

69导出 Markdown