AI 摘要
在线蒸馏有望成为后训练中的持久方法。涉及领域包括: 指令微调(SFT/IFT) RLHF 直接偏好优化(DPO等) RLVR 在线蒸馏(OPD) 新方法类别实属罕见!期待参与实践。
On-policy distillation is on track to be a lasting method in post-training. The list of areas would be:
Instruction tuning (SFT/IFT) RLHF Direct Preference Optimization (DPO et al) RLVR On-policy Distillation (OPD)
New classes of methods are rare! Excited to play.