基于经验回放的 LLM 高效强化学习训练

2026-04-09 08:00·85天前

AI 摘要

针对大语言模型后训练必须使用新鲜 on-policy 数据的传统观点，研究系统探讨了经验回放技术的应用。通过形式化分析 replay buffer 设计在数据陈旧性方差、样本多样性与生成计算成本间的权衡，发现当生成成本高昂时，严格的 on-policy 采样实为次优选择。实证表明，设计良好的 replay buffer 可在保持策略熵的同时，大幅减少推理计算，且不损害甚至提升最终模型性能。

原文 · 未翻译

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

基于经验回放的 LLM 高效强化学习训练

2026-04-09 08:00·85天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

arXiv