语言模型需要睡眠
阅读原文· arxiv.org针对Transformer架构在处理长上下文时注意力机制效率低的问题,研究提出一种“睡眠式巩固机制”。该方法让模型定期将近期上下文转换为持久化的快速权重,并清空键值缓存。期间,模型通过N次离线循环处理累积上下文,并通过局部规则更新其状态空间模型块中的快速权重。这使得额外计算被转移至“睡眠”阶段,从而保持了推理的实时性。该方法在细胞自动机、多跳图检索等合成任务及一项数学推理任务(常规Transformer及SSM-Attention混合模型均失败)上进行了测试。结果表明,增加睡眠持续期N能提升性能,在需要更深层次推理的任务上增益最大。
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.