Implementation Complexity and Architectural Generality
The Challenge: The "Bubble" and The "Wall"
The SGLang Pipeline Parallelism Architecture
1\. Chunked Pipeline Parallelism (CPP)
2\. Better Overlapping: Micro-batching and Async P2P Communication
3\. Advanced Option: Dynamic chunking
4\. Production Ready: Compatibility with PD Disaggregation and HiCache
Performance Impact
Input Token Throughput and Strong Scaling Efficiency
Reduced TTFT and Scaling Out for 1 million ITL
Getting Started
Future Roadmap:
Conclusion
Acknowledgement
Reference
Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts and Beyond
TL;DR
We are excited to introduce SGLang's highly optimized Pipeline Parallelism (PP) implementation, specifically engineered to tackle the challenges of ultra-long context inference. By integrating Chunked Pipeline Parallelism, Asynchronous P2P Communication, and a simple yet effective Dynamic Chunking mechanism, this PP design achieves industry-leading performance while ensuring seamless compatibility with other parallel strategies, PD Disaggregation, and HiCache. In multi-node deployments, scaling to PP4 TP8 with this implementation yields a 3.31× Prefill Throughput for DeepSeek-V3.1 on an H20 cluster compared to TP8 when the chunked prefill size is set to 12K, significantly outperforming the TP32 solution (2.54×) by a 30.5% margin. This highlights PP's inherent architectural advantage for large-scale, cross-node scaling over pure TP. Furthermore, our implementation also delivers up to a 67.9% reduction in TTFT while maintaining an 82.8% strong scaling efficiency, providing a highly efficient, open-source path for scaling trillion-parameter models for ultra-long context.
Implementation Complexity and Architectural Generality
The Challenge: The "Bubble" and The "Wall"
The SGLang Pipeline Parallelism Architecture
1\. Chunked Pipeline Parallelism (CPP)
2\. Better Overlapping: Micro-batching and Async P2P Communication
3\. Advanced Option: Dynamic chunking
4\. Production Ready: Compatibility with PD Disaggregation and HiCache
Prefill Throughput (Batch Size = 1) of DeepSeek-V3.1 on H20 (Higher is better) Note: DCK 12288 (σ=0.65) means enabling Dynamic Chunking with the initial chunked prefill size set to 12K, and the smooth factor set to 0.65.
Introduction
As Large Language Models (LLMs) scale toward trillion-parameter architectures and "infinite" context windows, the underlying serving infrastructure must evolve toward more granular, cross-node parallelization strategies. While KV cache techniques effectively mitigate redundant computation, they cannot circumvent the prohibitive Time to First Token (TTFT) inherent in ultra-long sequences with extremely large initial Input Token Length (ITL). Although Tensor Parallelism (TP) remains the conventional approach for intra-node scaling, it frequently encounters communication bottlenecks during multi-node deployments. On the other hand, despite traditional Pipeline Parallelism (PP) addressing this bottleneck by reducing the communication volume, it struggles with resource underutilization and bubble overhead when processing such massive prompts.
Drawing inspiration from both open-source innovations and academic research, SGLang introduces a highly optimized Pipeline Parallelism implementation featuring Asynchronous Communication and Dynamic Chunked Prefill, which effectively minimizes the pipeline bubbles. By integrating these techniques, SGLang explores and reframes the processing of ultra-long prompts—effectively scaling away the prohibitive latency of long-sequence prefilling and transforming it into a high-throughput, computationally scalable streaming workflow.
Empirical benchmarks demonstrate that SGLang’s PP implementation achieves industry-leading performance. In large-scale deployments, it maintains over 80% scaling efficiency for various model architectures while scaling out to PP4, and it also delivers up to an 81% reduction in TTFT for ultra-long prompts when deploying Qwen3-235B-A22B-FP8 on H20 with PP8.
Background: Why Pipeline Parallelism?
To validate the necessity of Pipeline Parallelism (PP) for long-context prefill, it is essential to evaluate it against existing paradigms—specifically Tensor Parallelism (TP) and Context Parallelism (CP). While TP and CP offer distinct advantages, a theoretical and empirical decomposition of their communication volumes, bubble ratios, and implementation complexities reveals that PP occupies a unique, optimal position for multi-node scaling. The following analysis outlines the specific trade-offs inherent to each method.
Communication Volume and Scalability Analysis
The primary bottleneck in distributed inference scaling is inter-device communication. As model depth and sequence length increase, the volume of data transmitted between devices becomes a limiting factor, especially while scaling to large-scale and multi-node deployments.
Assuming BBB stands for the Batch Size (often 1 for ultra-long context inference), SSS for the total Sequence Length, HHH for the Hidden State dimension, LLL for the total Layer Number, MMM for the Micro-batches size, and the activation precision is FP8 (1 byte). Based on this, we analyzed the communication volume of different parallel strategies.
TP: TP splits individual weight tensors across multiple devices within a single layer. Due to this, TP incurs high communication overhead due to the necessity of synchronization after both the Attention Block and MLP Block. Consequently, the communication volume scales linearly with the number of layers. This frequent All-Reduce synchronization makes TP bandwidth-bound, limiting its scalability across large clusters.
(Note: Each All-Reduce involves 2×2 \times2× the data size in a ring-based implementation. Each layer involves 2×2 \times2× All-Reduce operations, one after the Attention Block, and one after the MLP Block.)
CP: Similarly, CP requires extensive synchronization communication to aggregate Key-Value (KV) states across devices. Typically, CP utilizes All-Gather at every layer, resulting in significant latency penalties in bandwidth-constrained environments.
(Note: Assuming CP utilizes Ring-Attention-based solution. For models utilizing GQA, HKVH_{KV}HKV is smaller than HHH, which reduces CP's communication volume.)
PP: In contrast, PP exhibits a significantly reduced communication footprint. Data is transferred only at the boundaries of pipeline stages, using Point-to-Point (P2P) primitives rather than collective operations. Since a stage typically contains multiple layers, the communication frequency is determined by the number of stages (PPP), not the total number of layers (LLL). Crucially, for a fixed model, as we increase the number of layers per stage, the communication volume remains constant at the boundaries.
(Note: In multi-node deployments where P≪LP \ll LP≪L, PP achieves a nearly order-of-magnitude reduction in total communication volume compared to TP.)
The Bubble Ratio Trade-off
While PP optimizes communication, it introduces pipeline bubbles—idle periods where devices wait for data dependencies. This presents a trade-off between communication efficiency and device utilization.
TP and CP: Both methods achieve a zero bubble ratio theoretically, as all devices compute simultaneously on different parts of the same tensor or sequence. This maximizes compute intensity, assuming communication does not stall computation.
PP: PP inevitably incurs a bubble ratio, quantified by the interaction between the PP Size (PPP) and the number of Micro-batches (MMM):
However, for long-context prefill scenarios where the workload is substantial (M≫PM \gg PM≫P), this ratio decreases significantly, rendering the efficiency loss negligible compared to the communication gains. In the Performance Impact section, we will evaluate the Strong Scaling Efficiency (i.e., the number of processors is increased while the problem size remains constant) of our PP implementation.
It is worth noting that while PP offers a distinct advantage in cross-node scaling, where communication bandwidth often becomes the primary bottleneck, a pure high-degree PP configuration is generally not recommended. This is because, for a fixed workload MMM, the pipeline bubble ratio increases proportionally with the PP size PPP. Instead, a better strategy is to leverage bubble-free parallel methods, such as TP or CP, for intra-node scaling. Since intra-node communication typically utilizes high-bandwidth interconnects like NVLink, these collectives are far less likely to become a performance bottleneck compared to cross-node transfers, allowing the system to maximize compute utilization without incurring additional pipeline overhead.
Implementation Complexity and Architectural Generality
The implementation complexity and architectural generality of a new feature are critical factors for a modern inference system, especially for an open-source project.
TP: TP is easy to implement and widely supported. However, large-scale TP configurations are inherently inapplicable, as the granularity required for the quantization block sometimes cannot be aligned with the partitioning constraints imposed by MoE FFN weights. Consequently, even disregarding communication volume and overhead, larger TP is often precluded in multi-node scaling scenarios due to this incompatibility with quantization, which is a critical and indispensable optimization technique.
CP: CP is complex, which requires specific, often intrusive modifications to the attention mechanism (e.g., Ring Attention). These changes must be tailored for every attention variant and specific model, reducing generality.
PP: PP represents a medium complexity. It requires partitioning the model but remains agnostic to the internal mechanics of the layers. This makes PP a general-purpose solution applicable to all model architectures without requiring kernel-level rewrites for specific attention variants. To some extent, eliminating PP bubbles is more difficult than implementing PP itself.
In conclusion, the balance of the generality and scaling efficiency makes PP not merely an alternative, but a necessary component for scaling long-context prefill to massive, multi-node clusters where TP and CP encounter bandwidth ceilings. In the meantime, CP has the potential to complement TP for intra-node bubble-free scaling and acceleration. PP × CP is already under development (Future Roadmap), which will be included in Part II of this blog.
The Challenge: The "Bubble" and The "Wall"
In a traditional Pipeline Parallelism setup, the model layers are partitioned across GPUs (Stage 1 to Stage N). When serving standard requests (e.g., = v0.5.7
We are continuously refining the PP stack. Our 2026 H1 PP Roadmap includes these important tasks:
Compatibility with Context Parallelism to further reduce TTFT
Pipeline Parallelism for the Decode side Performance Optimization and best practice tuning
Performance Optimization and best practice tuning
Better fitting and chunking strategy for dynamic chunking
Conclusion
SGLang’s implementation of Pipeline Parallelism is more than just model splitting; it is a complete re-engineering of the inference lifecycle for the Long-Context Era. By combining chunked prefill with asynchronous communication and dynamic chunking, SGLang provides the most efficient and open-sourced path to serving and accelerating trillion-parameter models for long context.
Acknowledgement
We would like to thank the SGLang team and community for the implementation and generous support, especially Shangming Cai, Xuchun Shang, Yanbo Yang, Leon Gao, Ying Sheng, Zhiqiang Xie, Lianmin Zheng, and many others.
We would like to thank Jianhao Fu (from AntGroup SCT Network Team), Kevin Li (from TikTok), Siyu Liu (from Alibaba Cloud Computing), Xiaolei Zhang (from ByteDance), Teng Ma (from Alibaba Cloud Computing), Chao Wang (from Meituan), and Xiaowei Wang (from NVIDIA) for their prominent contribution in code improvement and testing.
We learn a lot from the system design of SGLang, Mooncake[1], and TeraPipe[3], which jointly help improve this Pipeline Parallelism implementation.
Reference
[1] Qin, Ruoyu, et al. "Mooncake: A kvcache-centric disaggregated architecture for llm serving." ACM Transactions on Storage (2024). [2] Yang, An, et al. "Qwen2. 5-1m technical report." arXiv preprint arXiv:2501.15383 (2025). [3] Li, Zhuohan, et al. "Terapipe: Token-level pipeline parallelism for training large-scale language models." International Conference on Machine Learning. PMLR, 2021.
Input Token Throughput and Strong Scaling Efficiency
Reduced TTFT and Scaling Out for 1 million ITL
Getting Started
Future Roadmap:
Conclusion
Acknowledgement
Reference
Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts and Beyond
TL;DR
We are excited to introduce SGLang's highly optimized Pipeline Parallelism (PP) implementation, specifically engineered to tackle the challenges of ultra-long context inference. By integrating Chunked Pipeline Parallelism, Asynchronous P2P Communication, and a simple yet effective Dynamic Chunking mechanism, this PP design achieves industry-leading performance while ensuring seamless compatibility with other parallel strategies, PD Disaggregation, and HiCache. In multi-node deployments, scaling to PP4 TP8 with this implementation yields a 3.31× Prefill Throughput for DeepSeek-V3.1 on an H20 cluster compared to TP8 when the chunked prefill size is set to 12K, significantly outperforming the TP32 solution (2.54×) by a 30.5% margin. This highlights PP's inherent architectural advantage for large-scale, cross-node scaling over pure TP. Furthermore, our implementation also delivers up to a 67.9% reduction in TTFT while maintaining an 82.8% strong scaling efficiency, providing a highly efficient, open-source path for scaling trillion-parameter models for ultra-long context.
Prefill Throughput (Batch Size = 1) of DeepSeek-V3.1 on H20 (Higher is better) Note: DCK 12288 (σ=0.65) means enabling Dynamic Chunking with the initial chunked prefill size set to 12K, and the smooth factor set to 0.65.
Introduction
As Large Language Models (LLMs) scale toward trillion-parameter architectures and "infinite" context windows, the underlying serving infrastructure must evolve toward more granular, cross-node parallelization strategies. While KV cache techniques effectively mitigate redundant computation, they cannot circumvent the prohibitive Time to First Token (TTFT) inherent in ultra-long sequences with extremely large initial Input Token Length (ITL). Although Tensor Parallelism (TP) remains the conventional approach for intra-node scaling, it frequently encounters communication bottlenecks during multi-node deployments. On the other hand, despite traditional Pipeline Parallelism (PP) addressing this bottleneck by reducing the communication volume, it struggles with resource underutilization and bubble overhead when processing such massive prompts.
Drawing inspiration from both open-source innovations and academic research, SGLang introduces a highly optimized Pipeline Parallelism implementation featuring Asynchronous Communication and Dynamic Chunked Prefill, which effectively minimizes the pipeline bubbles. By integrating these techniques, SGLang explores and reframes the processing of ultra-long prompts—effectively scaling away the prohibitive latency of long-sequence prefilling and transforming it into a high-throughput, computationally scalable streaming workflow.
Empirical benchmarks demonstrate that SGLang’s PP implementation achieves industry-leading performance. In large-scale deployments, it maintains over 80% scaling efficiency for various model architectures while scaling out to PP4, and it also delivers up to an 81% reduction in TTFT for ultra-long prompts when deploying Qwen3-235B-A22B-FP8 on H20 with PP8.
Background: Why Pipeline Parallelism?
To validate the necessity of Pipeline Parallelism (PP) for long-context prefill, it is essential to evaluate it against existing paradigms—specifically Tensor Parallelism (TP) and Context Parallelism (CP). While TP and CP offer distinct advantages, a theoretical and empirical decomposition of their communication volumes, bubble ratios, and implementation complexities reveals that PP occupies a unique, optimal position for multi-node scaling. The following analysis outlines the specific trade-offs inherent to each method.
Communication Volume and Scalability Analysis
The primary bottleneck in distributed inference scaling is inter-device communication. As model depth and sequence length increase, the volume of data transmitted between devices becomes a limiting factor, especially while scaling to large-scale and multi-node deployments.
Assuming BBB stands for the Batch Size (often 1 for ultra-long context inference), SSS for the total Sequence Length, HHH for the Hidden State dimension, LLL for the total Layer Number, MMM for the Micro-batches size, and the activation precision is FP8 (1 byte). Based on this, we analyzed the communication volume of different parallel strategies.
TP: TP splits individual weight tensors across multiple devices within a single layer. Due to this, TP incurs high communication overhead due to the necessity of synchronization after both the Attention Block and MLP Block. Consequently, the communication volume scales linearly with the number of layers. This frequent All-Reduce synchronization makes TP bandwidth-bound, limiting its scalability across large clusters.
(Note: Each All-Reduce involves 2×2 \times2× the data size in a ring-based implementation. Each layer involves 2×2 \times2× All-Reduce operations, one after the Attention Block, and one after the MLP Block.)
CP: Similarly, CP requires extensive synchronization communication to aggregate Key-Value (KV) states across devices. Typically, CP utilizes All-Gather at every layer, resulting in significant latency penalties in bandwidth-constrained environments.
(Note: Assuming CP utilizes Ring-Attention-based solution. For models utilizing GQA, HKVH_{KV}HKV is smaller than HHH, which reduces CP's communication volume.)
PP: In contrast, PP exhibits a significantly reduced communication footprint. Data is transferred only at the boundaries of pipeline stages, using Point-to-Point (P2P) primitives rather than collective operations. Since a stage typically contains multiple layers, the communication frequency is determined by the number of stages (PPP), not the total number of layers (LLL). Crucially, for a fixed model, as we increase the number of layers per stage, the communication volume remains constant at the boundaries.
(Note: In multi-node deployments where P≪LP \ll LP≪L, PP achieves a nearly order-of-magnitude reduction in total communication volume compared to TP.)
The Bubble Ratio Trade-off
While PP optimizes communication, it introduces pipeline bubbles—idle periods where devices wait for data dependencies. This presents a trade-off between communication efficiency and device utilization.
TP and CP: Both methods achieve a zero bubble ratio theoretically, as all devices compute simultaneously on different parts of the same tensor or sequence. This maximizes compute intensity, assuming communication does not stall computation.
PP: PP inevitably incurs a bubble ratio, quantified by the interaction between the PP Size (PPP) and the number of Micro-batches (MMM):
However, for long-context prefill scenarios where the workload is substantial (M≫PM \gg PM≫P), this ratio decreases significantly, rendering the efficiency loss negligible compared to the communication gains. In the Performance Impact section, we will evaluate the Strong Scaling Efficiency (i.e., the number of processors is increased while the problem size remains constant) of our PP implementation.
It is worth noting that while PP offers a distinct advantage in cross-node scaling, where communication bandwidth often becomes the primary bottleneck, a pure high-degree PP configuration is generally not recommended. This is because, for a fixed workload MMM, the pipeline bubble ratio increases proportionally with the PP size PPP. Instead, a better strategy is to leverage bubble-free parallel methods, such as TP or CP, for intra-node scaling. Since intra-node communication typically utilizes high-bandwidth interconnects like NVLink, these collectives are far less likely to become a performance bottleneck compared to cross-node transfers, allowing the system to maximize compute utilization without incurring additional pipeline overhead.
Implementation Complexity and Architectural Generality
The implementation complexity and architectural generality of a new feature are critical factors for a modern inference system, especially for an open-source project.
TP: TP is easy to implement and widely supported. However, large-scale TP configurations are inherently inapplicable, as the granularity required for the quantization block sometimes cannot be aligned with the partitioning constraints imposed by MoE FFN weights. Consequently, even disregarding communication volume and overhead, larger TP is often precluded in multi-node scaling scenarios due to this incompatibility with quantization, which is a critical and indispensable optimization technique.
CP: CP is complex, which requires specific, often intrusive modifications to the attention mechanism (e.g., Ring Attention). These changes must be tailored for every attention variant and specific model, reducing generality.
PP: PP represents a medium complexity. It requires partitioning the model but remains agnostic to the internal mechanics of the layers. This makes PP a general-purpose solution applicable to all model architectures without requiring kernel-level rewrites for specific attention variants. To some extent, eliminating PP bubbles is more difficult than implementing PP itself.
In conclusion, the balance of the generality and scaling efficiency makes PP not merely an alternative, but a necessary component for scaling long-context prefill to massive, multi-node clusters where TP and CP encounter bandwidth ceilings. In the meantime, CP has the potential to complement TP for intra-node bubble-free scaling and acceleration. PP × CP is already under development (Future Roadmap), which will be included in Part II of this blog.
The Challenge: The "Bubble" and The "Wall"
In a traditional Pipeline Parallelism setup, the model layers are partitioned across GPUs (Stage 1 to Stage N). When serving standard requests (e.g., = v0.5.7
We are continuously refining the PP stack. Our 2026 H1 PP Roadmap includes these important tasks:
Compatibility with Context Parallelism to further reduce TTFT
Pipeline Parallelism for the Decode side Performance Optimization and best practice tuning
Performance Optimization and best practice tuning
Better fitting and chunking strategy for dynamic chunking
Conclusion
SGLang’s implementation of Pipeline Parallelism is more than just model splitting; it is a complete re-engineering of the inference lifecycle for the Long-Context Era. By combining chunked prefill with asynchronous communication and dynamic chunking, SGLang provides the most efficient and open-sourced path to serving and accelerating trillion-parameter models for long context.
Acknowledgement
We would like to thank the SGLang team and community for the implementation and generous support, especially Shangming Cai, Xuchun Shang, Yanbo Yang, Leon Gao, Ying Sheng, Zhiqiang Xie, Lianmin Zheng, and many others.
We would like to thank Jianhao Fu (from AntGroup SCT Network Team), Kevin Li (from TikTok), Siyu Liu (from Alibaba Cloud Computing), Xiaolei Zhang (from ByteDance), Teng Ma (from Alibaba Cloud Computing), Chao Wang (from Meituan), and Xiaowei Wang (from NVIDIA) for their prominent contribution in code improvement and testing.
We learn a lot from the system design of SGLang, Mooncake[1], and TeraPipe[3], which jointly help improve this Pipeline Parallelism implementation.
Reference
[1] Qin, Ruoyu, et al. "Mooncake: A kvcache-centric disaggregated architecture for llm serving." ACM Transactions on Storage (2024). [2] Yang, An, et al. "Qwen2. 5-1m technical report." arXiv preprint arXiv:2501.15383 (2025). [3] Li, Zhuohan, et al. "Terapipe: Token-level pipeline parallelism for training large-scale language models." International Conference on Machine Learning. PMLR, 2021.