作者在其著作中补充了关于策略蒸馏如何成为核心后训练优化技术的历史回顾。其数学原理相对简单,其发展得益于分布式训练系统的进步。关键转折在于强化学习设置中采用蒸馏目标,这启发了丰富的奖励塑造思路。策略蒸馏的普及也源于近年来强化学习算法工程的大规模投入。技术演进从学习教师示范转向学生自我推演,回顾看似明显,实则经历了大量工作。相关研究如MiniLLM率先提出了类似策略梯度的在策略推演蒸馏方法。
Added a 1500 word mini history to my book on the path to on-policy distillation being a core post-training optimization technique.
The math is fairly simple, seems like the sort of thing that started working as our distributed systems for training got better. It's very remarkable to me that a blog post from @_kevinlu at @thinkymachines is the canonical reference for using the reverse KL distance as an advantage within policy-gradient tools. This switch to distillation objectives within RL setups enables a lot of fun reward shaping ideas.
This also means that on policy distillation was obviously helped in its proliferation by the mass engineering effort in getting RL algorithms right over the last few years.
Lastly, as someone already very familiar with @agarwl_'s early work on generalized knowledge distillation / connection to imitation learning algorithms like DAgger, I recommend reading concurrent work MiniLLM which was technically the first to propose using a policy-gradient-like, on-policy rollout approach for distillation.