使用多 Token 预测（MTP）加速 SGLang：吞吐量提升 60% 的推理优化方案

2025-07-17 00:00·351天前

AI 摘要

SGLang 推理框架现已支持多 Token 预测（MTP）技术，并与大规模专家并行（EP）、预填充-解码分离（PD Disaggregation）等特性无缝集成。该技术通过轻量级草稿模型预测多个未来 Token，再由完整目标模型并行验证，在保持生成质量不变的前提下，可将 DeepSeek V3 等模型的输出吞吐量提升高达 60%。在 16 张 H200 GPU 的小规模部署场景中，该方案显著优化了长序列推理效率，为生产环境提供即插即用的性能增益。

原文 · 未翻译

What is Multiple Token Prediction (MTP)?

Why MTP is Fast

Performance Evaluation

Deployment Scenarios and Design Motivation

Case Study 1: Small-Scale Deployment

Case Study 2: Large-Scale Deployment

MTP Best Practices

Future Work

Acknowledgment

Accelerating SGLang with Multiple Token Prediction

TL;DR

SGLang now supports smooth combination of these advanced features: Multiple Token Prediction (MTP), Large-Scale Expert Parallelism (EP), and Prefill-Decode disaggregation. This integration delivers up to 60% higher output throughput through a new decoding paradigm, better parallelism, and more efficient resource utilization without sacrificing generation quality. If you are serving models, e.g., DeepSeek V3, SGLang now supports MTP as a plug-and-play feature, unlocking immediate performance gains. You can find instruction for reproduction here.

SGLang’s inference framework running on NVIDIA GPUs enables AI practitioners to easily deliver inference at scale, empowering end users to “think smart” and harness the reasoning capabilities of state-of-the-art language models at the highest performance.

Introduction

While large language models continue to grow in capability, their token-by-token decoding process remains fundamentally sequential, creating a critical bottleneck for inference throughput. This limitation becomes especially apparent in high-demand applications, where maximizing GPU utilization is crucial for achieving high performance and cost-efficient deployment.

To address this, SGLang brings Multiple Token Prediction (MTP) to the open-source inference ecosystem, an advanced speculative decoding technique that accelerates generation by predicting multiple draft tokens with a lightweight draft model and verifying them in parallel using a single pass of the full model. In our benchmarks, MTP unlocks up to 60% higher output throughput for DeepSeek V3 without any loss in generation quality. With MTP now fully integrated, SGLang continues to push the frontier of open-source LLM serving, offering advanced decoding capabilities previously confined to proprietary systems, and making them accessible and production-ready.

LMSYS：Blog（Chatbot Arena 团队）

导出 Markdown