SGLang EPD 分离架构：视觉语言模型的弹性编码器扩展

2026-01-12 00:00·172天前

AI 摘要

SGLang推出EPD（Encoder-Prefill-Decode）分离架构，将视觉编码与语言处理解耦，支持编码器独立横向扩展以替代低效的张量并行。该方案兼容现有PD分离，支持ZMQ、Mooncake等传输后端及视觉嵌入缓存。测试表明，在图像密集型场景（如多图输入）且1 QPS负载下，首token生成时间（TTFT）较同机部署降低约6–8倍；但在图像轻量场景中，网络开销可能导致性能下降。

原文 · 未翻译

Contents

TL;DR

Introduction

The ViT Scaling Problem: Why Tensor Parallelism Doesn't Always Help

The Counter-Intuitive Finding

Architecture Overview

Key Components

Implementation Details

Image Distribution Strategies

Transfer Backends

Vision Embedding Cache

Usage Examples

Benchmark

Experimental Setup

Bench Results

Acknowledgment

Learn more

EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models in SGLang

TL;DR

We introduce Encoder-Prefill-Decode (EPD) Disaggregation in SGLang, a novel architecture that separates vision encoding from language processing in Vision-Language Models (VLMs). This can enable:

Independent scaling of vision encoding capacity: Encoder servers can be scaled horizontally without affecting language model deployment, enabling better resource utilization for vision-heavy workloads.

Compatibility with existing PD disaggregation: EPD can be combined with Prefill-Decode disaggregation for a complete three-tier architecture.

Flexible transfer backends: Support for multiple transfer mechanisms (ZMQ, GPU-direct via Mooncake) allows optimization for different deployment scenarios.

Vision embedding caching: Frequently used images can be cached at encoder servers, eliminating redundant ViT computations and reducing network transfer overhead.

EPD is highly effective in image-heavy scenarios (e.g., multi-image inputs), where the visual encoding process is the primary computational bottleneck. For instance, in these scenarios, we leverage EPD to significantly reduce request TTFT under load—achieving approximately 6–8× lower latency compared to the colocation approach at 1 QPS. Conversely, for image-light scenarios with few images, EPD may be less efficient or even counterproductive. This is because the additional network latency incurred by transmitting embeddings across nodes can outweigh the time saved by offloading the encoding task, potentially resulting in a higher TTFT compared to a colocation approach.

Introduction

Vision-Language Models (VLMs) like Qwen2.5-VL and Llama-Vision combine visual understanding with language generation. However, these models face unique scaling challenges:

LMSYS：Blog（Chatbot Arena 团队）

导出 Markdown

SGLang EPD 分离架构：视觉语言模型的弹性编码器扩展

2026-01-12 00:00·172天前

阅读原文· lmsys.org

AI 摘要

原文 · 保持原样，未翻译

Contents

TL;DR

Introduction

The ViT Scaling Problem: Why Tensor Parallelism Doesn't Always Help

The Counter-Intuitive Finding

Architecture Overview

Key Components

Implementation Details

Image Distribution Strategies

Transfer Backends

Vision Embedding Cache

Usage Examples

Benchmark

Experimental Setup

Bench Results

Acknowledgment

Learn more

EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models in SGLang