Google 通过多令牌预测将 Gemma 4 提速三倍
阅读原文· the-decoder.comGoogle 为其 Gemma 4 开源模型家族发布了多令牌预测模块,可将文本生成速度提升高达三倍。该技术通过一个小型辅助模型一次性预测多个令牌,再由主模型单次检查完成验证,从而显著提高了推理效率。
Google speeds up Gemma 4 threefold with multi-token prediction
Google has released multi-token prediction drafters (MTP) for its open AI model family Gemma 4, designed to speed up text generation by up to three times. LLMs normally generate text one token at a time, loading billions of parameters from memory at each step. The processor's computing core spends most of its time just waiting for data, Google says.
The company's new MTP technology tackles this bottleneck. While the main model waits for its data, a small auxiliary model uses the idle capacity to suggest several tokens at once. The main model then checks all those suggestions in a single pass—if they're correct, they get accepted at once. The smaller model is just filling time that would otherwise go to waste, so the same text gets produced faster with no loss in quality or accuracy, according to Google.