DOPD:优势感知的双在线蒸馏
阅读原文· arxiv.org在线策略蒸馏(OPD)通过密集的token级信号监督学生采样轨迹,实现能力迁移,但引入特权信息会引发“特权幻觉”——学生将信息不对称差距误认为可迁移的能力差距。该问题因token级监督的非均匀性而加剧,仅有少量token携带关键能力信号。DOPD提出优势感知的双蒸馏范式,根据优势差距和相对概率动态在特权教师与特权学生策略之间路由token级监督,缓解特权幻觉。在LLM和VLM上的实验表明,DOPD优于标准OPD及其他方法。
On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.