SGLang 引入 Waterfill 与 LPLB 提升 DeepEP MoE 负载均衡

2026-06-26 00:00·7天前

精选理由

SGLang 引入 Waterfill 和 LPLB 两种负载均衡算法，实测 DeepSeek V3/R1 和 V4 吞吐提升最高 7%，用 SGLang 跑 MoE 推理的开发者值得一试。

AI 摘要

SGLang 为 DeepEP MoE 推理新增两种调度时负载均衡方法：Waterfill 将共享专家分配给负载更低的 rank，在 DeepSeek-V3/R1 服务负载下使总吞吐量提升 1.48% 至 4.66%，在 DeepSeek V4 上最佳点从 49,253 tok/s 提升至 51,677 tok/s（+4.92%）；LPLB 基于线性规划优化冗余专家副本的 token 路由，配合 EPLB 在相同集群上实现吞吐量提升 0.84% 至 7.34%。

原文 · 未翻译

[](mailto:contact@lmsys.org "Email")[](https://twitter.com/lmsysorg "X / Twitter")[](https://slack.sglang.io/ "Slack")[](https://youtube.com/@lmsys-org "YouTube") Projects Blog About Donations Contact

[](mailto:contact@lmsys.org "Email")[](https://twitter.com/lmsysorg "X / Twitter")[](https://slack.sglang.io/ "Slack")[](https://youtube.com/@lmsys-org "YouTube")

‹ Back to Blog

‹ Back to Blog Contents TL;DR Introduction Background: Load Imbalance in DeepEP MoE Inference Waterfill: Lightweight Load Balancing for Shared Expert Dispatch Waterfill Dispatch Strategy Shared Expert Fusion as the Enabling Mechanism LPLB: LP-Based Load Balancing for Redundant Expert Replicas The Problem LPLB Solves The LP Formulation From Global Counts to a Solved LP From LP Solution to Token Dispatch How LPLB Differs from Waterfill When LPLB Helps Most Evaluation Waterfill and LPLB on DeepSeek V3/R1 Waterfill on DeepSeek V4 Accuracy Validation How to Use Enable Waterfill Enable LPLB Acknowledgment Improving DeepEP MoE Load Balance in SGLang with Waterfill and LPLB

NVIDIA Team June 26, 2026 TL;DR

Mixture-of-Experts (MoE) models rely on Expert Parallelism (EP) to scale inference across multiple GPUs. In SGLang, DeepEP and EPLB provide high-performance serving under EP, but the workload seen by each rank can still be imbalanced because tokens are not routed uniformly across experts.

This blog introduces two dispatch-time load balancing features in SGLang: Waterfill, a lightweight shared-expert load balancing method for DeepEP. It dispatches the shared expert through DeepEP and assigns it to less-loaded ranks. On two Hopper GPU nodes with DeepSeek-V3/R1-style serving workloads, Waterfill improves total throughput by across MMLU, GPQA, and GSM8K. On DeepSeek V4, the best measured point improved from to (). , a linear-programming-based load balancer for redundant expert replicas. It solves a per-layer dispatch optimization problem over redundant experts. With redundant EPLB placement on the same two Hopper GPU nodes, LPLB improves total throughput by across MMLU, GPQA, and GSM8K.

LMSYS：Blog（Chatbot Arena 团队）

精选58导出 Markdown