状态预测分离假说:双计算流Transformer变体提升语言建模效率
阅读原文· arxiv.orgTransformer使用同一前向计算流同时预测下一个token和存储用于未来预测的状态。为解耦这两个角色,作者提出状态预测分离假说,并设计了一种采用双计算流的Transformer变体。不同规模下的预训练实验表明,该方法在数据和计算效率上持续优于标准Transformer,验证损失更低,下游任务平均性能提升2-3个百分点。进一步的实证分析排除了潜在混淆因素,揭示了新设计在梯度上的根本差异。
Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the state-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.