# On-policy distillation provides an elegant way to use the teacher model as a process reward model to…

- 来源：Lilian Weng (@lilianweng)
- 发布时间：2025-10-28 01:31
- AIHOT 链接：https://aihot.virxact.com/items/cmnz6dpg002b6sl0f8703cafs
- 原文链接：https://x.com/lilianweng/status/1982862795961184572

## AI 摘要

On-policy distillation 提供了一种优雅的方式，将教师模型用作过程奖励模型以提供密集奖励，同时防止 rollout 期间出现 SFT 风格的"OOD shock"。

[引用 @thinkymachines]：我们最新的文章探讨了 on-policy distillation，这是一种将 RL 的错误纠正相关性与 SFT 的奖励密度相结合的训练方法。在将其用于数学推理和内部聊天助手训练时，我们发现 on-policy distillation 能以一小部分成本胜过其他方法。

https://thinkingmachines.ai/blog/on-policy-distillation/

## 正文

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.

### 引用推文

> Thinking Machines：Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When train...