由 OpenBMB、SGLang 和 NVIDIA 联合主办的 SOAR 2026 挑战赛结束,旨在单消费级 GPU 上最大化 MiniCPM-SALA(稀疏+线性混合注意力模型)推理性能。最终 326 支队伍注册,4300+ 次提交,69 队入围排行榜。冠军团队实现整体 6.33 倍加速,单请求推理峰值达 9.72 倍,方案结合 NVFP4 量化、FlashInfer plan-cache 优化、自定义 Triton 内核、EAGLE-3 推测解码及运行时感知调度。低比特量化、推测解码、稀疏注意力和阶段感知调度被视为下一代高效推理核心支柱。
SOAR 2026 has officially wrapped up! 🎉
Hosted by @OpenBMB, @SGLang, and @NVIDIA, the challenge tasked developers worldwide with maximizing the inference performance of MiniCPM-SALA - our sparse+linear hybrid attention model - on a single consumer GPU.
On June 6, we brought the SOAR 2026 community together in Beijing for our final in-person Meetup. Developers, researchers, and open-source builders from @NVIDIA, @SGLang, and @OpenBMB gathered to share hard-won lessons from the frontlines of inference optimization. From Blackwell architecture tuning to SGLang-Omni and the Densing Law, it was a powerful reminder that inference efficiency is a full-stack, cross-community effort.☺️