自蒸馏策略梯度

2026-06-02 08:00·31天前

AI 摘要

论文提出自蒸馏策略梯度（SDPG）框架，结合群体相对验证器优势、归一化标准差、精确全词汇在策略自蒸馏及参考策略KL正则化。在稀疏奖励强化学习中，语言模型基于特权上下文自监督生成，利用全词汇学生到教师反向KL散度作为辅助损失。实验表明SDPG在稳定性和性能上优于RLVR和自蒸馏基线。代码已开源。

原文 · 未翻译

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.

HuggingFace Daily Papers（社区热门论文）

56导出 Markdown

自蒸馏策略梯度

2026-06-02 08:00·31天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

数据/训练论文/研究