Ling-2.6-1T TPU 推理优化：用 Pallas Kernel 隐藏 MoE 数据搬运

2026-06-24 15:01·8天前·百灵大模型

AI 摘要

蚂蚁 ASystem Core 与 SGLang-JAX 团队在 TPU v7x 上优化了 1T 参数稀疏 MoE 模型 Ling-2.6-1T 的推理性能。核心是 Fused MoE V2 Pallas kernel，将 scatter、expert FFN 和 gather 合并，通过计算与数据搬运重叠降低延迟。相比 V1，MoE prefill latency 从 5.16 ms 降至 2.42 ms（降 53%），decode kernel latency 从 0.249 ms 降至 0.211 ms。仅替换 MoE kernel 即可使 prefill throughput 提升 24.8%，decode throughput 提升 18.5%–35.3%。在 SGLang decode benchmark 下，16 颗 TPU v7x 的 output throughput 达到 16 张 H200 的 1.29x–1.77x。该工作还完整支持 hybrid backbone，包括 hybrid KV/recurrent memory pools、GLA linear attention 及 single-controller data parallelism。

公众号正文需在微信内阅读，站内仅提供摘要。

公众号：蚂蚁百灵（Ling）

49导出 Markdown

Ling-2.6-1T TPU 推理优化：用 Pallas Kernel 隐藏 MoE 数据搬运

2026-06-24 15:01·8天前·百灵大模型

AI 摘要

公众号正文需在微信内阅读，站内仅提供摘要。

在微信中打开原文mp.weixin.qq.com

推理