# 状态预测分离假说：双计算流Transformer变体提升语言建模效率

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-07-01 08:00
- AIHOT 分数：53
- AIHOT 链接：https://aihot.virxact.com/items/cmr307xet0cvgsl8zbxokxg56
- 原文链接：https://arxiv.org/abs/2607.01218

## AI 摘要

Transformer使用同一前向计算流同时预测下一个token和存储用于未来预测的状态。为解耦这两个角色，作者提出状态预测分离假说，并设计了一种采用双计算流的Transformer变体。不同规模下的预训练实验表明，该方法在数据和计算效率上持续优于标准Transformer，验证损失更低，下游任务平均性能提升2-3个百分点。进一步的实证分析排除了潜在混淆因素，揭示了新设计在梯度上的根本差异。

## 正文

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the state-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.
