大语言模型在线策略蒸馏再思考:现象、机制与优化方案
阅读原文· arxiv.org研究系统分析了大语言模型在线策略蒸馏(OPD)的动态机制,发现其成功依赖两个关键条件:师生模型需具备兼容的思维模式,且教师必须提供学生未接触的新能力。通过弱到强反向蒸馏实验,研究证实同家族1.5B与7B教师从学生视角分布不可区分。机制上,成功OPD表现为对高概率token的渐进对齐,仅3%共享token集即承载97%-99%概率质量。研究提出离线冷启动与教师对齐提示选择两种优化策略,同时指出OPD密集token级奖励的隐性成本,质疑其在长程蒸馏中的可扩展性。
On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.