Lilian Weng@lilianweng

2025-10-28 01:31·248天前

AI 摘要

On-policy distillation 提供了一种优雅的方式，将教师模型用作过程奖励模型以提供密集奖励，同时防止 rollout 期间出现 SFT 风格的"OOD shock"。 [引用 @thinkymachines]：我们最新的文章探讨了 on-policy distillation，这是一种将 RL 的错误纠正相关性与 SFT 的奖励密度相结合的训练方法。在将其用于数学推理和内部聊天助手训练时，我们发现 on-policy distillation 能以一小部分成本胜过其他方法。 https://thinkingmachines.ai/blog/on-policy-distillation/

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.

Thinking MachinesOur latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When train...

推理数据/训练论文/研究

在 X 查看原推导出 Markdown

Lilian Weng@lilianweng · X

导出 Markdown

2025-10-28 01:31·248天前

在 X 看原推· x.com

AI 摘要

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.

Thinking MachinesOur latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When train...

推理