OPRD：在线策略表示蒸馏

2026-06-04 08:00·29天前

AI 摘要

传统在线策略蒸馏（OPD）仅在输出空间匹配下一个token概率，受限于大词汇表（如Qwen约150k token）的采样方差，且忽略教师中间隐藏状态。OPRD将蒸馏提升至隐藏状态空间，在同一rollout上对齐学生与教师选定层的表示，绕过LM头。理论上消除采样方差，提供逐层结构信息。在AIME 2024/2025和AIMO上，OPRD缩小师生差距，而OPD基线低于教师。训练速度提升1.44倍，内存减少54%。代码已开源。

原文 · 未翻译

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

HuggingFace Daily Papers（社区热门论文）

69导出 Markdown

OPRD：在线策略表示蒸馏

2026-06-04 08:00·29天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译