OpenBMB@OpenBMB · 6月18日51SOAR 2026 has officially wrapped up! 🎉
Hosted by @OpenBMB, @SGLang, and @NVIDIA, the challenge tasked developers worldwide with maximizing the inference performance of MiniCPM-SALA — our sparse+linear hybrid attention model — on a single consumer GPU.
On June 6, we brought the SOAR 2026 community together in Beijing for our final in-person Meetup. Developers, researchers, and open-source builders from @NVIDIA, @SGLang, and @OpenBMB gathered to share hard-won lessons from the frontlines of inference optimization. From Blackwell architecture tuning to SGLang-Omni and the Densing Law, it was a powerful reminder that inference efficiency is a full-stack, cross-community effort.☺️
Huge thanks to our co-hosts @SGLang and @NVIDIA for making this possible — and to every participant who submitted, iterated, and shared. 😘
Final Metrics:
📊 326 teams registered, 370 participants
📊 4,300+ total submissions
📊 69 teams on the final leaderboard
🏆 The winning team achieved an overall 6.33x speedup over baseline — peaking at 9.72x on single-request inference. Their solution combined:
🔹 NVFP4 quantization with hybrid GEMM dispatch
🔹 FlashInfer plan-cache optimization
🔹 Custom Triton kernels for GLA layers
🔹 EAGLE-3 speculative decoding with dynamic depth switching
🔹 Runtime-aware scheduling across different concurrency levels
Low-bit quantization, speculative decoding, sparse attention, and phase-aware scheduling are emerging as the core pillars of next-gen efficient inference. SOAR 2026 put that thesis to the test — and the community delivered.
The leaderboard is closed, but the optimizations, code, and conversations will live on in the open-source ecosystem. 🚀
🔗 MiniCPM-SALA: http://huggingface.co/openbmb/MiniCPM-SALA
译由 OpenBMB、SGLang 和 NVIDIA 联合主办的 SOAR 2026 挑战赛结束,旨在单消费级 GPU 上最大化 MiniCPM-SALA(稀疏+线性混合注意力模型)推理性能。最终 326 支队伍注册,4300+ 次提交,69 队入围排行榜。冠军团队实现整体 6.33 倍加速,单请求推理峰值达 9.72 倍,方案结合 NVFP4 量化、FlashInfer plan-cache 优化、自定义 Triton 内核、EAGLE-3 推测解码及运行时感知调度。低比特量化、推测解码、稀疏注意力和阶段感知调度被视为下一代高效推理核心支柱。