SemiAnalysis发布CUDA MOAT警报:在不到70天内,通过纯软件优化,Kimi架构(与xAI的Cursor Composer 2.5相同模型架构)在GB200 NVL72上的服务成本降低2.5倍。关键优化是使用CuTe-DSL重写NVFP4 MoE kernel,作为现有宽专家并行优化的补充。该优化利用了NVL72的铜背板,带宽是标准RoCEv2/InfiniBand的18倍。此项工作由Xin Li、Jun Yang及NVIDIA团队完成。
CUDA MOAT ALERT 🔥: In less than 70 days, GB200 NVL72 serving costs decreased by 2.5x through software improvements alone for the Kimi architecture, which is the same model architecture as xAI's popular Cursor Composer 2.5. One of the key software optimizations was rewriting the NVFP4 MoE kernel using CuTe-DSL, which is additive to the existing wide-expert parallelism optimization. This takes advantage of NVL72's copper backplane, which has 18x higher bandwidth than standard RoCEv2/InfiniBand.
Great work by Xin Li, Jun Yang, & the NVIDIA team on decreasing serving costs by 2.5x in less than 70 days! 🔥