‹ Back to Blog Contents TL;DR Introduction Background: Load Imbalance in DeepEP MoE Inference Waterfill: Lightweight Load Balancing for Shared Expert Dispatch Waterfill Dispatch Strategy Shared Expert Fusion as the Enabling Mechanism LPLB: LP-Based Load Balancing for Redundant Expert Replicas The Problem LPLB Solves The LP Formulation From Global Counts to a Solved LP From LP Solution to Token Dispatch How LPLB Differs from Waterfill When LPLB Helps Most Evaluation Waterfill and LPLB on DeepSeek V3/R1 Waterfill on DeepSeek V4 Accuracy Validation How to Use Enable Waterfill Enable LPLB Acknowledgment Improving DeepEP MoE Load Balance in SGLang with Waterfill and LPLB
NVIDIA Team June 26, 2026 TL;DR
Mixture-of-Experts (MoE) models rely on Expert Parallelism (EP) to scale inference across multiple GPUs. In SGLang, DeepEP and EPLB provide high-performance serving under EP, but the workload seen by each rank can still be imbalanced because tokens are not routed uniformly across experts.
This blog introduces two dispatch-time load balancing features in SGLang: Waterfill, a lightweight shared-expert load balancing method for DeepEP. It dispatches the shared expert through DeepEP and assigns it to less-loaded ranks. On two Hopper GPU nodes with DeepSeek-V3/R1-style serving workloads, Waterfill improves total throughput by across MMLU, GPQA, and GSM8K. On DeepSeek V4, the best measured point improved from to (). , a linear-programming-based load balancer for redundant expert replicas. It solves a per-layer dispatch optimization problem over redundant experts. With redundant EPLB placement on the same two Hopper GPU nodes, LPLB improves total throughput by across MMLU, GPQA, and GSM8K.
The Waterfill work is built on two SGLang PRs: shared expert fusion under EP and Waterfill dispatch balancing. DeepSeek V4 support is added in #25391. LPLB is introduced in #24515. Introduction
Large MoE models such as DeepSeek-V3/R1 and DeepSeek V4 use sparse expert activation to increase model capacity while keeping per-token computation manageable. During inference, EP distributes experts across GPUs and routes tokens to the ranks that own the selected experts. This reduces per-GPU memory pressure and makes large-scale serving practical, but it also introduces a central systems problem: the router does not generate perfectly balanced expert traffic.
When some experts receive many more tokens than others, the EP group waits for the busiest ranks. This imbalance affects both computation and communication. Static placement methods such as EPLB can improve the long-term placement of experts and redundant replicas, but a single batch can still have residual imbalance. Dispatch-time load balancing addresses this remaining gap by deciding, at runtime, which physical replica should process each token or each shared-expert request.
In SGLang, we have been working on two dispatch-time approaches for DeepEP MoE inference: Waterfill: a low-overhead algorithm focused on the shared expert path. LPLB: an LP-based algorithm focused on token routing across redundant expert replicas.
The two algorithms target the same broad layer of the system: dispatch-time MoE load balancing. They make different tradeoffs and operate on different dispatch choices. Background: Load Imbalance in DeepEP MoE Inference
DeepEP accelerates MoE inference by providing optimized token dispatch and combine kernels for expert parallelism. In a typical DeepSeek-style MoE layer, each token is routed to several routed experts selected by the model router. Some models also include a shared expert, which is applied to every token.
From a serving-system perspective, routed experts and shared experts create different load patterns: Routed experts are sparse. Different tokens choose different experts, so their load depends on the router distribution. Shared experts are dense. Every token needs the shared expert, so the shared-expert workload is present for the full batch. Redundant experts, introduced by EPLB-style placement, provide multiple physical replicas for some logical experts. They create an opportunity for dispatch-time balancing, because the system can choose which physical replica processes a token without changing the model's logical expert choice.
Static expert placement is helpful, but it cannot remove all runtime imbalance. The actual tokens in a batch may still concentrate on a subset of experts or ranks. In DeepEP, this can leave some ranks waiting for overloaded peers. Waterfill and LPLB both aim to reduce this dispatch-time imbalance while preserving the model's semantics. Waterfill: Lightweight Load Balancing for Shared Expert Dispatch Waterfill Dispatch Strategy
Waterfill is a lightweight load balancing algorithm for the shared expert path under DeepEP.
If the shared expert is always computed locally on every rank, then each rank pays the shared-expert cost regardless of whether it is already overloaded by routed experts. The overloaded ranks remain overloaded, and the less-loaded ranks cannot help absorb the shared-expert work.
Waterfill changes this by treating the shared expert as a dispatchable expert slot. After the routed experts are selected, Waterfill estimates the current routed load on each EP rank, then assigns the shared expert work to ranks with lower load. Conceptually, it fills the valleys in the rank-load distribution, similar to pouring water into uneven containers.
For each token, Waterfill adds one extra expert slot for the shared expert. Instead of always assigning that slot to the token's local rank, it selects a rank based on the current load distribution. This keeps the routed expert choices unchanged, so the model still computes the same logical routed experts and the same shared expert. The only thing that changes is which physical rank executes the shared expert work.
At a high level, the algorithm is:
Count the routed expert load already landing on each EP rank.
Use that count as a per-rank load score. In dynamic mode, SGLang first runs one EP-group collective, so the score can use the global routed-load vector plus each rank's current local batch size.
Add one shared-expert slot per participating token, and compute a target waterline:
$$ H = \lceil \frac{\underset{r}{\sum} L_{r} + N}{R} \rceil $$
H=⌈R∑rL r+N⌉ Here $L_{r}$L r is rank $r$r's load score, $N$N is the number of shared-expert slots to place, and $R$R is the EP group size.
Ranks below this waterline have slack:
$$ S_{r} = max \left(\right. H - L_{r} , 0 \left.\right) $$
S r=max(H−L r,0) 5. For each token, Waterfill samples the shared-expert target rank from candidate ranks with probability proportional to slack, with a small local-rank preference. If all candidates have zero slack, it falls back to the clearly lighter candidate rank, again keeping the local-rank preference.
The detailed derivation and the exact SGLang static/dynamic behavior are documented in the Waterfill dispatch balancing PR.
There is an important communication tradeoff. If every token could send its shared-expert work to any EP rank, Waterfill would have more balancing freedom, but it could also increase all-to-all traffic. For GPU MoE serving, communication is often more expensive than the extra shared-expert computation. The communication-conservative candidate set therefore keeps the shared expert on ranks that the token already visits for routed experts, with the source rank kept as a fallback. SGLang also supports an all-rank mode, which gives Waterfill more balancing freedom but can add a new per-token dispatch destination. This is a deliberate communication tradeoff rather than a change in model semantics.
By shifting shared-expert work away from already-heavy ranks and toward lighter ranks, Waterfill balances per-rank work and improves end-to-end throughput.
Figure 1. Waterfill moves shared-expert work from overloaded ranks to lighter ranks while keeping the routed expert choices unchanged, shortening the slowest MoE-layer path without changing model semantics. Shared Expert Fusion as the Enabling Mechanism
Waterfill can be further accelerated by fusing shared experts and routed experts.
Under EP, shared experts used a separate execution path from the routed experts. After Waterfill chooses non-local shared-expert ranks, that design would need to extract shared-expert tokens from the dispatched routed-expert layout and launch a separate shared-expert computation, adding extra layout conversion and launch overhead.
Shared expert fusion avoids that path by representing the shared expert as another expert slot in the same DeepEP MoE layout. In DeepSeek V3/R1, the router still selects the original routed top-k experts, and the TopK output gets one additional column for the shared expert. In the DeepEP physical expert ID layout, each rank reserves one extra shared-expert slot next to its routed experts. This lets routed experts and the shared expert share the same DeepEP dispatch, grouped-GEMM, and combine flow.
This is why the Waterfill feature was split into two pieces: #20089 fuses the shared expert into the DeepEP MoE path with a fixed local assignment. #19290 adds Waterfill, which replaces the fixed assignment with load-aware shared-expert dispatch.
The fusion itself is not the final load balancing algorithm. It is the required mechanism that makes shared-expert dispatch visible to DeepEP and therefore controllable by Waterfill. LPLB: LP-Based Load Balancing for Redundant Expert Replicas The Problem LPLB Solves
EPLB places redundant replicas of hot logical experts and then, by default, splits each hot expert's tokens evenly across its physical copies. Even splitting is optimal only when the offline distribution used to build the placement matches the live traffic. In practice it often does not: a single batch concentrates on different experts than the calibration set, the served dataset drifts away from the recording dataset, and the rebalance period is long enough that placement is effectively static for many batches. When that happens, evenly dividing a hot expert's load still leaves the ranks that own its copies unevenly loaded relative to the rest of the EP group, and the whole group waits on the busiest rank.
LPLB closes this gap at dispatch time. For each MoE layer, on each batch, it looks at the _actual_ per-expert token counts and decides how to split each replicated expert's tokens across its physical copies so that the maximum per-rank load is minimized. It does not move weights and it does not change the router's logical top-k choices — it only chooses, among the valid physical replicas of a logical expert, how much traffic each replica receives. The result is an optimal min–max assignment for the batch in front of it, rather than the static even split EPLB bakes in offline. The LP Formulation
LPLB casts this as a small linear program solved per layer. The intuition maps directly onto the constraints: Objective — minimize the peak. Introduce a scalar M representing the maximum load over all ranks, and minimize it. Driving M down pulls the busiest rank toward the average, which is exactly what shortens the grouped-GEMM tail that EP imbalance creates. Rank-load constraints. For every rank, _(load from its redundant-expert copies) + (load from its single-copy experts) + (slack to the peak) = M_. The single-copy load on…
‹ Back to Blog Contents TL;DR Introduction Background: Load Imbalance in DeepEP MoE Inference Waterfill: Lightweight Load Balancing for Shared Expert Dispatch Waterfill Dispatch Strategy Shared Expert Fusion as the Enabling Mechanism LPLB: LP-Based Load Balancing for Redundant Expert Replicas The Problem LPLB Solves The LP Formulation From Global Counts to a Solved LP From LP Solution to Token Dispatch How LPLB Differs from Waterfill When LPLB Helps Most Evaluation Waterfill and LPLB on DeepSeek V3/R1 Waterfill on DeepSeek V4 Accuracy Validation How to Use Enable Waterfill Enable LPLB Acknowledgment Improving DeepEP MoE Load Balance in SGLang with Waterfill and LPLB
NVIDIA Team June 26, 2026 TL;DR
Mixture-of-Experts (MoE) models rely on Expert Parallelism (EP) to scale inference across multiple GPUs. In SGLang, DeepEP and EPLB provide high-performance serving under EP, but the workload seen by each rank can still be imbalanced because tokens are not routed uniformly across experts.
This blog introduces two dispatch-time load balancing features in SGLang: Waterfill, a lightweight shared-expert load balancing method for DeepEP. It dispatches the shared expert through DeepEP and assigns it to less-loaded ranks. On two Hopper GPU nodes with DeepSeek-V3/R1-style serving workloads, Waterfill improves total throughput by +1.48% to +4.66% across MMLU, GPQA, and GSM8K. On DeepSeek V4, the best measured point improved from 49,253 tok/s to 51,677 tok/s (+4.92%). LPLB, a linear-programming-based load balancer for redundant expert replicas. It solves a per-layer dispatch optimization problem over redundant experts. With redundant EPLB placement on the same two Hopper GPU nodes, LPLB improves total throughput by +0.84% to +7.34% across MMLU, GPQA, and GSM8K.
The Waterfill work is built on two SGLang PRs: shared expert fusion under EP and Waterfill dispatch balancing. DeepSeek V4 support is added in #25391. LPLB is introduced in #24515. Introduction
Large MoE models such as DeepSeek-V3/R1 and DeepSeek V4 use sparse expert activation to increase model capacity while keeping per-token computation manageable. During inference, EP distributes experts across GPUs and routes tokens to the ranks that own the selected experts. This reduces per-GPU memory pressure and makes large-scale serving practical, but it also introduces a central systems problem: the router does not generate perfectly balanced expert traffic.
When some experts receive many more tokens than others, the EP group waits for the busiest ranks. This imbalance affects both computation and communication. Static placement methods such as EPLB can improve the long-term placement of experts and redundant replicas, but a single batch can still have residual imbalance. Dispatch-time load balancing addresses this remaining gap by deciding, at runtime, which physical replica should process each token or each shared-expert request.
In SGLang, we have been working on two dispatch-time approaches for DeepEP MoE inference: Waterfill: a low-overhead algorithm focused on the shared expert path. LPLB: an LP-based algorithm focused on token routing across redundant expert replicas.
The two algorithms target the same broad layer of the system: dispatch-time MoE load balancing. They make different tradeoffs and operate on different dispatch choices. Background: Load Imbalance in DeepEP MoE Inference
DeepEP accelerates MoE inference by providing optimized token dispatch and combine kernels for expert parallelism. In a typical DeepSeek-style MoE layer, each token is routed to several routed experts selected by the model router. Some models also include a shared expert, which is applied to every token.
From a serving-system perspective, routed experts and shared experts create different load patterns: Routed experts are sparse. Different tokens choose different experts, so their load depends on the router distribution. Shared experts are dense. Every token needs the shared expert, so the shared-expert workload is present for the full batch. Redundant experts, introduced by EPLB-style placement, provide multiple physical replicas for some logical experts. They create an opportunity for dispatch-time balancing, because the system can choose which physical replica processes a token without changing the model's logical expert choice.
Static expert placement is helpful, but it cannot remove all runtime imbalance. The actual tokens in a batch may still concentrate on a subset of experts or ranks. In DeepEP, this can leave some ranks waiting for overloaded peers. Waterfill and LPLB both aim to reduce this dispatch-time imbalance while preserving the model's semantics. Waterfill: Lightweight Load Balancing for Shared Expert Dispatch Waterfill Dispatch Strategy
Waterfill is a lightweight load balancing algorithm for the shared expert path under DeepEP.
If the shared expert is always computed locally on every rank, then each rank pays the shared-expert cost regardless of whether it is already overloaded by routed experts. The overloaded ranks remain overloaded, and the less-loaded ranks cannot help absorb the shared-expert work.
Waterfill changes this by treating the shared expert as a dispatchable expert slot. After the routed experts are selected, Waterfill estimates the current routed load on each EP rank, then assigns the shared expert work to ranks with lower load. Conceptually, it fills the valleys in the rank-load distribution, similar to pouring water into uneven containers.
For each token, Waterfill adds one extra expert slot for the shared expert. Instead of always assigning that slot to the token's local rank, it selects a rank based on the current load distribution. This keeps the routed expert choices unchanged, so the model still computes the same logical routed experts and the same shared expert. The only thing that changes is which physical rank executes the shared expert work.
At a high level, the algorithm is:
Count the routed expert load already landing on each EP rank.
Use that count as a per-rank load score. In dynamic mode, SGLang first runs one EP-group collective, so the score can use the global routed-load vector plus each rank's current local batch size.
Add one shared-expert slot per participating token, and compute a target waterline:
$$ H = \lceil \frac{\underset{r}{\sum} L_{r} + N}{R} \rceil $$
H=⌈R∑rL r+N⌉ Here $L_{r}$L r is rank $r$r's load score, $N$N is the number of shared-expert slots to place, and $R$R is the EP group size.
Ranks below this waterline have slack:
$$ S_{r} = max \left(\right. H - L_{r} , 0 \left.\right) $$
S r=max(H−L r,0) 5. For each token, Waterfill samples the shared-expert target rank from candidate ranks with probability proportional to slack, with a small local-rank preference. If all candidates have zero slack, it falls back to the clearly lighter candidate rank, again keeping the local-rank preference.
The detailed derivation and the exact SGLang static/dynamic behavior are documented in the Waterfill dispatch balancing PR.
There is an important communication tradeoff. If every token could send its shared-expert work to any EP rank, Waterfill would have more balancing freedom, but it could also increase all-to-all traffic. For GPU MoE serving, communication is often more expensive than the extra shared-expert computation. The communication-conservative candidate set therefore keeps the shared expert on ranks that the token already visits for routed experts, with the source rank kept as a fallback. SGLang also supports an all-rank mode, which gives Waterfill more balancing freedom but can add a new per-token dispatch destination. This is a deliberate communication tradeoff rather than a change in model semantics.
By shifting shared-expert work away from already-heavy ranks and toward lighter ranks, Waterfill balances per-rank work and improves end-to-end throughput.
Figure 1. Waterfill moves shared-expert work from overloaded ranks to lighter ranks while keeping the routed expert choices unchanged, shortening the slowest MoE-layer path without changing model semantics. Shared Expert Fusion as the Enabling Mechanism
Waterfill can be further accelerated by fusing shared experts and routed experts.
Under EP, shared experts used a separate execution path from the routed experts. After Waterfill chooses non-local shared-expert ranks, that design would need to extract shared-expert tokens from the dispatched routed-expert layout and launch a separate shared-expert computation, adding extra layout conversion and launch overhead.
Shared expert fusion avoids that path by representing the shared expert as another expert slot in the same DeepEP MoE layout. In DeepSeek V3/R1, the router still selects the original routed top-k experts, and the TopK output gets one additional column for the shared expert. In the DeepEP physical expert ID layout, each rank reserves one extra shared-expert slot next to its routed experts. This lets routed experts and the shared expert share the same DeepEP dispatch, grouped-GEMM, and combine flow.
This is why the Waterfill feature was split into two pieces: #20089 fuses the shared expert into the DeepEP MoE path with a fixed local assignment. #19290 adds Waterfill, which replaces the fixed assignment with load-aware shared-expert dispatch.
The fusion itself is not the final load balancing algorithm. It is the required mechanism that makes shared-expert dispatch visible to DeepEP and therefore controllable by Waterfill. LPLB: LP-Based Load Balancing for Redundant Expert Replicas The Problem LPLB Solves
EPLB places redundant replicas of hot logical experts and then, by default, splits each hot expert's tokens evenly across its physical copies. Even splitting is optimal only when the offline distribution used to build the placement matches the live traffic. In practice it often does not: a single batch concentrates on different experts than the calibration set, the served dataset drifts away from the recording dataset, and the rebalance period is long enough that placement is effectively static for many batches. When that happens, evenly dividing a hot expert's load still leaves the ranks that own its copies unevenly loaded relative to the rest of the EP group, and the whole group waits on the busiest rank.
LPLB closes this gap at dispatch time. For each MoE layer, on each batch, it looks at the _actual_ per-expert token counts and decides how to split each replicated expert's tokens across its physical copies so that the maximum per-rank load is minimized. It does not move weights and it does not change the router's logical top-k choices — it only chooses, among the valid physical replicas of a logical expert, how much traffic each replica receives. The result is an optimal min–max assignment for the batch in front of it, rather than the static even split EPLB bakes in offline. The LP Formulation
LPLB casts this as a small linear program solved per layer. The intuition maps directly onto the constraints: Objective — minimize the peak. Introduce a scalar M representing the maximum load over all ranks, and minimize it. Driving M down pulls the busiest rank toward the average, which is exactly what shortens the grouped-GEMM tail that EP imbalance creates. Rank-load constraints. For every rank, _(load from its redundant-expert copies) + (load from its single-copy experts) + (slack to the peak) = M_. The single-copy load on…