KPop：稳定大规模MoE模型强化学习训练的新技术

Ant Ling@AntLingAGI

2026-05-26 23:32·25天前

AI 摘要

团队发布了KPop技术，用于稳定大规模MoE模型的强化学习训练。它取代了此前IcePop方法的固定比例掩码，改用自适应二元KL散度区域来匹配每个token的固有噪声，从而实现更鲁棒的参数更新，支持长期、智能体化的强化学习训练。具体应用中，万亿参数的Ring-2.6-1T模型在仅使用纯强化学习训练（未修改基础设施或路由重放）的情况下，于SWE-bench Verified评测中得分超过76。KPop仅通过一个关键参数即可实现该优化。

From IcePop to KPop - our team keeps pushing on RL training stability for large MoE models. 👇

KPop replaces the fixed-ratio mask with an adaptive binary-KL region that matches each token's inherent noise. More robust updates， stable long-horizon agentic RL.

Ring-2.6-1T → 76+ on SWE-bench Verified， pure RL.

Congrats to @Jia__Guo & team！

Blog： https：//ringtech.notion.site/kpop

Jia GuoCurious about the secret sauce behind our trillion-scale agentic foundation model? Here it comes!🥳 Last year, we released IcePop to stabilize MoE RL with doubl...

智能体数据/训练论文/研究

在 X 查看原推

Ant Ling@AntLingAGI · X