蚂蚁百灵与 SGLang 团队合作,将 1T 参数的混合 MoE 模型 Ling-2.6-1T 通过 SGLang-JAX 部署至 TPU v7x。优化包括:升级 Fused MoE V2 内核(token 和累加器驻留 VMEM,双缓冲专家权重,隐藏路由与预取);混合内存池(10 个全注意力层 per-token MLA KV + 70 个 GLA 层 per-request 循环状态);GLA 线性注意力逐块并行预填充;单控制器 DP 保持分组 RMSNorm 芯片本地化。效果:MoE 预填充延迟降低 53%;在 16 芯片 TPU v7x 切片上,解码吞吐量比同类 H200 集群最高提升 1.77 倍。
It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳 The resulting performance gains speak for themselves: -53% reduction in MoE pre-fill latency -Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster A significant milestone in efficient MoE scaling and hardware utilization!