Nathan Lambert@natolambert

2026-06-04 10:39·29天前

AI 摘要

Nathan Lambert 评论该视频正是其写书所需，认为前沿自蒸馏工作影响显著。Dwarkesh Patel 记录 Sasha Rush 的即兴讲解：当模型在 rollout 中出错（例如调用不存在的工具），无需针对整条轨迹的最终奖励学习（信号噪声大），而是让另一个模型阅读轨迹定位错误位置，在错误处上方插入 hint tokens，再让原模型运行一次前向传播，利用 hint 使模型对错误 token 分配更低概率，然后训练原模型匹配这些新概率。整个过程无需重新生成 rollout（无额外解码开销）。

Great little video on modern on-policy distillation in post-training recipes.

Wish I had this when writing the section on distillation for my book. And where I've been bearish on a lot of the academic work for self-distillation， it seems impactful at the frontier.

Dwarkesh PatelRecently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my ...

智能体教程/实践数据/训练

在 X 查看原推导出 Markdown

Nathan Lambert@natolambert · X

62导出 Markdown

2026-06-04 10:39·29天前

在 X 看原推· x.com

AI 摘要

Great little video on modern on-policy distillation in post-training recipes.

Wish I had this when writing the section on distillation for my book. And where I've been bearish on a lot of the academic work for self-distillation， it seems impactful at the frontier.

Dwarkesh PatelRecently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my ...

智能体