团队发布了KPop技术,用于稳定大规模MoE模型的强化学习训练。它取代了此前IcePop方法的固定比例掩码,改用自适应二元KL散度区域来匹配每个token的固有噪声,从而实现更鲁棒的参数更新,支持长期、智能体化的强化学习训练。具体应用中,万亿参数的Ring-2.6-1T模型在仅使用纯强化学习训练(未修改基础设施或路由重放)的情况下,于SWE-bench Verified评测中得分超过76。KPop仅通过一个关键参数即可实现该优化。
From IcePop to KPop - our team keeps pushing on RL training stability for large MoE models. 👇
KPop replaces the fixed-ratio mask with an adaptive binary-KL region that matches each token's inherent noise. More robust updates, stable long-horizon agentic RL.
Ring-2.6-1T → 76+ on SWE-bench Verified, pure RL.
Congrats to @Jia__Guo & team!
Blog: https://ringtech.notion.site/kpop