微软新论文Next-Latent Prediction (NextLat) 提出一种自监督学习方法,在常规token预测基础上增加预测下一隐藏状态的任务,迫使Transformer学习紧凑的内部世界模型。该方法在地图式世界建模、数学推理、图规划、故事预测等任务上表现更优,生成速度通过自推测解码最高提升3.3x,且无需改变Transformer架构或减慢正常推理。
New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens.
The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models.
That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward.
NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word.
A hidden state is the model's private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time.