# Ling-2.6-1T TPU 推理优化：用 Pallas Kernel 隐藏 MoE 数据搬运

- 来源：公众号：蚂蚁百灵（Ling）
- 作者：百灵大模型
- 发布时间：2026-06-24 15:01
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmqrqusie0mw4slp5pa0v42xv
- 原文链接：https://mp.weixin.qq.com/s/Ql7lU0d4uf5_f1MscFMSQg

## AI 摘要

蚂蚁 ASystem Core 与 SGLang-JAX 团队在 TPU v7x 上优化了 1T 参数稀疏 MoE 模型 Ling-2.6-1T 的推理性能。核心是 Fused MoE V2 Pallas kernel，将 scatter、expert FFN 和 gather 合并，通过计算与数据搬运重叠降低延迟。相比 V1，MoE prefill latency 从 5.16 ms 降至 2.42 ms（降 53%），decode kernel latency 从 0.249 ms 降至 0.211 ms。仅替换 MoE kernel 即可使 prefill throughput 提升 24.8%，decode throughput 提升 18.5%–35.3%。在 SGLang decode benchmark 下，16 颗 TPU v7x 的 output throughput 达到 16 张 H200 的 1.29x–1.77x。该工作还完整支持 hybrid backbone，包括 hybrid KV/recurrent memory pools、GLA linear attention 及 single-controller data parallelism。

## 正文

公众号正文需在微信内阅读，站内仅提供摘要。