Ant Ling@AntLingAGI

2026-06-18 11:02·14天前

AI 摘要

蚂蚁百灵与 SGLang 团队合作，将 1T 参数的混合 MoE 模型 Ling-2.6-1T 通过 SGLang-JAX 部署至 TPU v7x。优化包括：升级 Fused MoE V2 内核（token 和累加器驻留 VMEM，双缓冲专家权重，隐藏路由与预取）；混合内存池（10 个全注意力层 per-token MLA KV + 70 个 GLA 层 per-request 循环状态）；GLA 线性注意力逐块并行预填充；单控制器 DP 保持分组 RMSNorm 芯片本地化。效果：MoE 预填充延迟降低 53%；在 16 芯片 TPU v7x 切片上，解码吞吐量比同类 H200 集群最高提升 1.77 倍。

It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳 The resulting performance gains speak for themselves： -53% reduction in MoE pre-fill latency -Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster A significant milestone in efficient MoE scaling and hardware utilization！

LMSYS Org🚀 Our new blog: Optimizing Ling-2.6-1T on TPU with SGLang-JAX: Hiding MoE Data Movement Behind Compute with One Pallas Kernel Ling-2.6-1T, a 1T hybrid MoE mode...

推理论文/研究部署/工程

在 X 查看原推导出 Markdown

Ant Ling@AntLingAGI · X

50导出 Markdown

2026-06-18 11:02·14天前

在 X 看原推· x.com

AI 摘要

LMSYS Org🚀 Our new blog: Optimizing Ling-2.6-1T on TPU with SGLang-JAX: Hiding MoE Data Movement Behind Compute with One Pallas Kernel Ling-2.6-1T, a 1T hybrid MoE mode...