# Nvidia加入多教师在线策略蒸馏（MODP）阵营，后训练标准已确立

- 来源：Nathan Lambert (@natolambert)
- 发布时间：2026-06-04 21:36
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpzjruya046aslkpybfg5jlj
- 原文链接：https://x.com/natolambert/status/2062528878997029030

## AI 摘要

Nvidia采用多教师在线策略蒸馏（MODP）作为后训练核心方法，标志该范式成为行业标准。其流水线重新设计：先进行SFT，再在多智能体/推理/代码/安全环境中执行多环境RLVR，最后用10+领域专长教师通过密集token级指导蒸馏到学生模型的自生成输出上。该标准由DeepSeek R1开创，微软早期模型也使用多教师SFT→RL路线。

## 正文

Nvidia joined the multi-teacher， on-policy distillation （MODP） gang！ Is industry standard post-training right now.

The multi-teacher SFT to RL that Microsoft did in their first model was the standard established by DeepSeek R1. I expect MAI 2 to be MODP.

### 引用推文

> Oleksii Kuchaiev：Our post-training pipeline is a substantial redesign from Super. The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across ...