蒸馏LLM反馈用于Lean定理证明

2026-05-29 08:00·35天前

AI 摘要

针对推理模型后训练中GRPO存在的稀疏奖励、有限探索和模式坍缩问题，提出Feedback Distillation方法。该方法让模型在token级别匹配其自身分布（基于语言模型提供的特权反馈），提供token级监督并注入外部知识。在Lean4定理证明任务上，Feedback Distillation相比GRPO保持更高轨迹多样性，获得更高策略熵和更好的pass@k缩放。两种方法互补：从Feedback Distillation检查点初始化GRPO优于单独使用任一方法，为改进复杂推理后训练提供了有前景的方向。

原文 · 未翻译

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

HuggingFace Daily Papers（社区热门论文）

49导出 Markdown

蒸馏LLM反馈用于Lean定理证明

2026-05-29 08:00·35天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译