SG-OPD：通过符号一致门控和分阶段教师采样的符号门控在线策略蒸馏

2026-06-08 08:00·25天前

AI 摘要

在线策略蒸馏（OPD）依赖学生-教师轨迹对齐及教师偏好逐token可靠性的隐含假设，但实际常失效。为此，SG-OPD提出符号一致门控和分阶段教师采样两种互补粒度的信任信号：冷启动阶段混入验证器认可的教师轨迹，并在教师与验证器纠正方向一致时外推蒸馏更新、不一致时内插。在竞赛级数学推理基准上，SG-OPD相比标准OPD每样本平均提升1.98分，每问题平均提升7.50分。

原文 · 未翻译

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

HuggingFace Daily Papers（社区热门论文）

55导出 Markdown

SG-OPD：通过符号一致门控和分阶段教师采样的符号门控在线策略蒸馏

2026-06-08 08:00·25天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译