# 并非所有分歧都可学习：在线策略蒸馏中的Token可教学性

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmpulkg6204v6slaginqnfrhu
- 原文链接：https://arxiv.org/abs/2605.26844

## AI 摘要

在线策略蒸馏（OPD）利用教师模型的token级监督，对学生模型的生成序列进行训练。现有方法优先选择高熵或高分歧的token。本研究指出，原始的KL分歧是一个粗糙指标，它混合了“可学习分歧”和“不相容分歧”。为此，研究提出了“Token可教学性”概念来衡量信号的实际可学习性，并据此设计了轻量级的TA-OPD方法，仅对高可教学性的位置应用蒸馏损失。在通义千问2.5与通义千问3的师生设置实验中，TA-OPD仅保留5%的token即可超越完整的token级OPD及基于熵/散度的基线方法，将选择性蒸馏重新定义为筛选可学习的教师信号。

## 正文

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.