vLLM 团队与 NVIDIA 合作,为 MiniMax M3 模型提供开箱即用的 day 0 体验,并集成 Inferact 的 EAGLE3 推测解码。当前工作包括:NVIDIA、Inferact 与 SemiAnalysis 推动拆分推理(PR 45879),Inferact 团队启用 FlashInfer M3 MoE 内核(PR 45723),落地后性能将显著提升。NVIDIA 表示 M3 已加入 DeepSeek V4 和 Kimi-K2.6 等前沿开放智能体模型行列。NVIDIA Blackwell Ultra 在 M3 上比 Hopper 实现最高 5 倍 AI 工厂吞吐量,并超过 300 TPS/user。未来通过优化内核、NVFP4 及 NVIDIA Dynamo 拆分推理等,性能有望进一步提升。
Great work to @vllm_project team and @NVIDIA on smooth, out-of-the-box day 0 @MiniMax_AI M3 experience with @inferact EAGLE3 spec decode. Here are the details of ongoing M3 workstream:
NVIDIA, Inferact and SemiAnalysis are working hard on enabling disaggregated inferencing (PR 45879), and the Inferact team is working on enabling FlashInfer M3 MoE kernels (PR 45723). Performance should be much better once those PRs land. Huge shoutout to @rogerw0108 & @mgoin_ and the maintainers for the rapid review and mentorship here!