SGLang 优化 Pipeline Parallelism 支持百万 Token 超长上下文

2026-01-15 00:00·169天前

AI 摘要

SGLang 发布面向超长上下文推理的 Pipeline Parallelism（PP）优化实现，集成 Chunked PP、异步 P2P 通信与动态分块机制。在 H20 集群 PP4 TP8 配置下，DeepSeek-V3.1 的 Prefill Throughput 较 TP8 提升 3.31 倍，较 TP32 领先 30.5%，TTFT 降低 67.9%，强扩展效率达 82.8%。该方案兼容 PD 分离与 HiCache，为万亿参数模型百万 Token 上下文推理提供高效开源路径。

原文 · 未翻译

Background: Why Pipeline Parallelism?

Communication Volume and Scalability Analysis

The Bubble Ratio Trade-off

Implementation Complexity and Architectural Generality

The Challenge: The "Bubble" and The "Wall"

The SGLang Pipeline Parallelism Architecture

1\. Chunked Pipeline Parallelism (CPP)

2\. Better Overlapping: Micro-batching and Async P2P Communication

3\. Advanced Option: Dynamic chunking

4\. Production Ready: Compatibility with PD Disaggregation and HiCache

Performance Impact

Input Token Throughput and Strong Scaling Efficiency

Reduced TTFT and Scaling Out for 1 million ITL

Getting Started

Future Roadmap:

Conclusion

Acknowledgement

Reference

Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts and Beyond

TL;DR

We are excited to introduce SGLang's highly optimized Pipeline Parallelism (PP) implementation, specifically engineered to tackle the challenges of ultra-long context inference. By integrating Chunked Pipeline Parallelism, Asynchronous P2P Communication, and a simple yet effective Dynamic Chunking mechanism, this PP design achieves industry-leading performance while ensuring seamless compatibility with other parallel strategies, PD Disaggregation, and HiCache. In multi-node deployments, scaling to PP4 TP8 with this implementation yields a 3.31× Prefill Throughput for DeepSeek-V3.1 on an H20 cluster compared to TP8 when the chunked prefill size is set to 12K, significantly outperforming the TP32 solution (2.54×) by a 30.5% margin. This highlights PP's inherent architectural advantage for large-scale, cross-node scaling over pure TP. Furthermore, our implementation also delivers up to a 67.9% reduction in TTFT while maintaining an 82.8% strong scaling efficiency, providing a highly efficient, open-source path for scaling trillion-parameter models for ultra-long context.

LMSYS：Blog（Chatbot Arena 团队）

导出 Markdown