# 少即是多：用于On-policy蒸馏的早期停止生成策略

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpovigyt09i0slv4l88ggu9v
- 原文链接：https://arxiv.org/abs/2605.27028

## AI 摘要

研究发现在On-policy蒸馏中存在“离策略教师衰减”问题：当学生模型的早期轨迹作为上下文时，教师模型为后续token产生修正分数的能力会衰减。为此，本文提出早期停止生成策略，将生成的rollout限制在前几个响应token上。实验表明，该策略在不同模型规模、模型族、任务和训练设置下均能超越完整的On-policy蒸馏性能，并展现出更高的GPU效率和训练稳定性，尤其在跨模型族场景中。研究进一步揭示了其“级联对齐”与“子模式承诺”效应，这解释了其有效性的机制。

## 正文

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.
