MOPD：多教师在线蒸馏实现大语言模型后训练能力整合

2026-06-29 08:00·4天前

AI 摘要

大语言模型后训练中常用强化学习提升特定能力，但多能力整合困难。现有Off-Policy Finetune和Mix-RL等方法效率低或性能下降。MOPD提出新范式：先对每个领域进行专用RL训练获得领域教师，再在学生模型自身rollout上蒸馏这些教师，消除暴露偏差并提供密集优化信号。在Qwen3-30B-A3B上，MOPD优于Mix-RL、Cascade RL、Off-Policy Finetune和Param-Merge基线，几乎继承每位教师全部能力。MOPD支持领域教师并行独立开发，去除跨领域耦合，已部署于工业级模型MiMo-V2-Flash的后训练。

原文 · 未翻译

Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers, then distill these teachers into the student on its own rollouts. This eliminates exposure bias and provides a dense optimization signal. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all of each teacher's capability. MOPD also enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training. MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model, demonstrating its practical value for capability integration in frontier-scale LLMs.

HuggingFace Daily Papers（社区热门论文）

50导出 Markdown

MOPD：多教师在线蒸馏实现大语言模型后训练能力整合

2026-06-29 08:00·4天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译