# SCOPE：信号校准的双路径自适应加权同策略蒸馏增强

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-12 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnyot4lh007xsl0fmnwr2hl9
- 原文链接：https://arxiv.org/abs/2604.10688

## AI 摘要

针对大语言模型推理对齐中同策略蒸馏（OPD）均匀监督导致信号质量被忽视的问题，研究者提出信号校准的双路径框架 SCOPE。该方法将 on-policy rollout 按正确性分流处理：对错误轨迹采用教师困惑度加权的 KL 蒸馏以优先利用可靠纠正信号，对正确轨迹采用学生困惑度加权的 MLE 以强化能力边界处的低置信度样本，并通过组级归一化自适应校准权重分布。在六项推理基准上的实验显示，SCOPE 较竞争基线平均提升 Avg@32 达 11.42%、Pass@32 达 7.30%。

## 正文

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.
