# Sasha Rush 讲解现代在线策略蒸馏：后训练中的针对性自蒸馏方法

- 来源：Nathan Lambert (@natolambert)
- 发布时间：2026-06-04 10:39
- AIHOT 分数：62
- AIHOT 链接：https://aihot.virxact.com/items/cmpyvzvwn03u7sli34k077z5c
- 原文链接：https://x.com/natolambert/status/2062363476711072124

## AI 摘要

Nathan Lambert 评论该视频正是其写书所需，认为前沿自蒸馏工作影响显著。Dwarkesh Patel 记录 Sasha Rush 的即兴讲解：当模型在 rollout 中出错（例如调用不存在的工具），无需针对整条轨迹的最终奖励学习（信号噪声大），而是让另一个模型阅读轨迹定位错误位置，在错误处上方插入 hint tokens，再让原模型运行一次前向传播，利用 hint 使模型对错误 token 分配更低概率，然后训练原模型匹配这些新概率。整个过程无需重新生成 rollout（无额外解码开销）。

## 正文

Great little video on modern on-policy distillation in post-training recipes.

Wish I had this when writing the section on distillation for my book. And where I've been bearish on a lot of the academic work for self-distillation， it seems impactful at the frontier.

### 引用推文

> Dwarkesh Patel：Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my ...