Nathan Lambert 为新书新增 7.4 小时讲座视频,内容涵盖从 2015 年 Hinton 知识蒸馏论文到当下多教师 on-policy 蒸馏(OPD、MOPD、OPSD)。视频重点讲解了使 on-policy distillation 适配主流 RL 框架所需的 3–4 项核心公式改动,并回顾了合成数据如何逐步占领训练后数据研究领域。此外还介绍了 Constitutional AI、AI 反馈以及将评分准则作为奖励等主流方法。时间线:00:00 合成数据兴起,10:50 师生蒸馏背景,24:47 on-policy 蒸馏,37:11 Constitutional AI,45:50 评分准则与结论。
Something I should add -- on-policy distillation was the last content I got to sneak into the book before going to print.
Felt very important to have this method covered, it's growing rapidly and used in distinct ways.
So you can also read what is covered in this lecture!