OmniOPD：基于推测验证的无需logits在线策略蒸馏

2026-05-31 08:00·33天前

AI 摘要

OmniOPD是一种无需教师token级logits的在线策略蒸馏框架。它通过蒙特卡洛展开在多token块上以连续语义相似度近似教师偏好，并用峰值熵调度器仅在高不确定性推理分叉处施加监督，同时以Dirichlet-Multinomial贝叶斯先验和基模型KL锚点防止策略坍塌。在数学基准上，OmniOPD相比标准OPD提升高达28.64%；与Claude-4.5-Haiku和Gemini-2.5-Flash等黑箱教师配合时，额外相对提升9.54%，令学生模型超越自我探索强化学习。

原文 · 未翻译

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

HuggingFace Daily Papers（社区热门论文）

62导出 Markdown

OmniOPD：基于推测验证的无需logits在线策略蒸馏

2026-05-31 08:00·33天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译