Nathan Lambert 评论该视频正是其写书所需,认为前沿自蒸馏工作影响显著。Dwarkesh Patel 记录 Sasha Rush 的即兴讲解:当模型在 rollout 中出错(例如调用不存在的工具),无需针对整条轨迹的最终奖励学习(信号噪声大),而是让另一个模型阅读轨迹定位错误位置,在错误处上方插入 hint tokens,再让原模型运行一次前向传播,利用 hint 使模型对错误 token 分配更低概率,然后训练原模型匹配这些新概率。整个过程无需重新生成 rollout(无额外解码开销)。
Great little video on modern on-policy distillation in post-training recipes.
Wish I had this when writing the section on distillation for my book. And where I've been bearish on a lot of the academic work for self-distillation, it seems impactful at the frontier.