# 策略蒸馏成为后训练优化核心技术

- 来源：Nathan Lambert (@natolambert)
- 发布时间：2026-05-06 23:13
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmou7wvh4011tslndqbr6h6a4
- 原文链接：https://x.com/natolambert/status/2052044068792967573

## AI 摘要

作者在其著作中补充了关于策略蒸馏如何成为核心后训练优化技术的历史回顾。其数学原理相对简单，其发展得益于分布式训练系统的进步。关键转折在于强化学习设置中采用蒸馏目标，这启发了丰富的奖励塑造思路。策略蒸馏的普及也源于近年来强化学习算法工程的大规模投入。技术演进从学习教师示范转向学生自我推演，回顾看似明显，实则经历了大量工作。相关研究如MiniLLM率先提出了类似策略梯度的在策略推演蒸馏方法。

## 正文

Added a 1500 word mini history to my book on the path to on-policy distillation being a core post-training optimization technique.

The math is fairly simple， seems like the sort of thing that started working as our distributed systems for training got better. It's very remarkable to me that a blog post from @_kevinlu at @thinkymachines is the canonical reference for using the reverse KL distance as an advantage within policy-gradient tools. This switch to distillation objectives within RL setups enables a lot of fun reward shaping ideas.

This also means that on policy distillation was obviously helped in its proliferation by the mass engineering effort in getting RL algorithms right over the last few years.

Lastly， as someone already very familiar with @agarwl_'s early work on generalized knowledge distillation / connection to imitation learning algorithms like DAgger， I recommend reading concurrent work MiniLLM which was technically the first to propose using a policy-gradient-like， on-policy rollout approach for distillation.

The switch from learning from teacher demonstrations to student rollouts seems so obvious in hindsight， where we are with RL hype， but at the time obviously took at bunch of work to get right.

Excited to figure out how to make post-training recipes around this！