Nathan Lambert@natolambert

2026-05-19 07:00·45天前

AI 摘要

在线蒸馏有望成为后训练中的持久方法。涉及领域包括：指令微调（SFT/IFT） RLHF 直接偏好优化（DPO等） RLVR 在线蒸馏（OPD）新方法类别实属罕见！期待参与实践。

On-policy distillation is on track to be a lasting method in post-training. The list of areas would be：

Instruction tuning （SFT/IFT） RLHF Direct Preference Optimization （DPO et al） RLVR On-policy Distillation （OPD）

New classes of methods are rare！ Excited to play.

Nathan Lambert@natolambert · X

2026-05-19 07:00·45天前

AI 摘要

On-policy distillation is on track to be a lasting method in post-training. The list of areas would be：

Instruction tuning （SFT/IFT） RLHF Direct Preference Optimization （DPO et al） RLVR On-policy Distillation （OPD）

New classes of methods are rare！ Excited to play.