状态预测分离假说：双计算流Transformer变体提升语言建模效率

2026-07-01 08:00·1天前

AI 摘要

Transformer使用同一前向计算流同时预测下一个token和存储用于未来预测的状态。为解耦这两个角色，作者提出状态预测分离假说，并设计了一种采用双计算流的Transformer变体。不同规模下的预训练实验表明，该方法在数据和计算效率上持续优于标准Transformer，验证损失更低，下游任务平均性能提升2-3个百分点。进一步的实证分析排除了潜在混淆因素，揭示了新设计在梯度上的根本差异。

原文 · 未翻译

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the state-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.

HuggingFace Daily Papers（社区热门论文）

53导出 Markdown

状态预测分离假说：双计算流Transformer变体提升语言建模效率

2026-07-01 08:00·1天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译