密集监督下的稀疏更新:在线策略蒸馏的稀疏性与几何特性
阅读原文· arxiv.org在线策略蒸馏(OPD)结合智能体在线轨迹与密集教师监督,分析发现其更新幅度小且坐标稀疏,分布在各层、集中于FFN权重。仅训练子网络即可恢复近完整性能;但密集监督保留异质梯度尺度,SGD逊于AdamW。几何上更新满秩但谱集中,主要偏离源权重主奇异子空间,落在源权重近零的坐标上。
On-policy distillation (OPD) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, OPD-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full OPD. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn OPD into ordinary dense parameter rewriting; instead, OPD retains important geometric signatures of on-policy post-training.