# 蒸馏LLM反馈用于Lean定理证明

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmq95cqy908rdslldywkrs09j
- 原文链接：https://arxiv.org/abs/2605.30861

## AI 摘要

针对推理模型后训练中GRPO存在的稀疏奖励、有限探索和模式坍缩问题，提出Feedback Distillation方法。该方法让模型在token级别匹配其自身分布（基于语言模型提供的特权反馈），提供token级监督并注入外部知识。在Lean4定理证明任务上，Feedback Distillation相比GRPO保持更高轨迹多样性，获得更高策略熵和更好的pass@k缩放。两种方法互补：从Feedback Distillation检查点初始化GRPO优于单独使用任一方法，为改进复杂推理后训练提供了有前景的方向。

## 正文

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.