Trust-Region Behavior Blending for On-Policy Distillation：信赖域行为融合用于在策略蒸馏

2026-05-29 08:00·35天前

AI 摘要

为解决在策略蒸馏中学生模型早期策略质量低导致教师监督效果差的问题，提出了信赖域行为融合方法。该方法在训练初期，在学生以自身为中心的KL散度信赖域内，使用最接近教师的策略替代学生策略进行前缀采样，同时保持蒸馏损失不变。通过将KL预算退火至零，训练会平滑过渡回纯粹的学生策略。在两种数学推理蒸馏设置中，TRB取得了最佳的平均表现。

原文 · 未翻译

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

HuggingFace Daily Papers（社区热门论文）

58导出 Markdown

Trust-Region Behavior Blending for On-Policy Distillation：信赖域行为融合用于在策略蒸馏

2026-05-29 08:00·35天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

推理

Trust-Region Behavior Blending for On-Policy Distillation： 信赖域行为融合用于在策略蒸馏

Trust-Region Behavior Blending for On-Policy Distillation： 信赖域行为融合用于在策略蒸馏

Trust-Region Behavior Blending for On-Policy Distillation：信赖域行为融合用于在策略蒸馏

Trust-Region Behavior Blending for On-Policy Distillation：信赖域行为融合用于在策略蒸馏