atomic.chat的MTP(多Token预测)技术通过一次验证多个草稿token,有效减少了GPU重复读取模型权重的次数,显著提升了本地大模型的推理速度。测试显示,27B密集模型的速度从51 token/s提升至117 token/s,提升约137%;35B MoE模型在2x RTX 5090上速度提升约25%。该技术实现了约80%的草稿接受率,无精度损失,仅需额外约1GB显存。由于密集模型需要读取全部参数,其从该技术中获益更大。此项目已开源。
Another good news for local-LLM from atomic【.】chat, that runs 100% offline on your computer.
They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B.
And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090.
Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints.
And this makes local LLMs much faster when the draft tokens are accepted often enough.
For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation.