Nvidia采用多教师在线策略蒸馏(MODP)作为后训练核心方法,标志该范式成为行业标准。其流水线重新设计:先进行SFT,再在多智能体/推理/代码/安全环境中执行多环境RLVR,最后用10+领域专长教师通过密集token级指导蒸馏到学生模型的自生成输出上。该标准由DeepSeek R1开创,微软早期模型也使用多教师SFT→RL路线。
Nvidia joined the multi-teacher, on-policy distillation (MODP) gang! Is industry standard post-training right now.
The multi-teacher SFT to RL that Microsoft did in their first model was the standard established by DeepSeek R1. I expect MAI 2 to be MODP.