在 SGLang 中支持新 VLMs：NVILA 案例研究

2025-07-16 00:00·352天前

AI 摘要

NVILA 团队发布技术博客，详解如何在 SGLang 推理框架中集成新型视觉语言模型。文章以 NVILA 为实践案例，提供从模型适配、推理优化到部署的完整开发指南与代码实践。随着多模态大模型成为行业焦点，该方案填补了 SGLang 生态在视觉理解模型支持方面的文档空白，为开发者快速接入新 VLM 提供了标准化技术路径与最佳实践。

原文 · 未翻译

Contents

Accelerating the NVILA Visual Language Model with SGLang

The Big Picture: How VLMs like NVILA Work

Supporting New Models in SGLang

Step 1: Register the Model as Multimodal

Step 2: Register a New Chat Template

Understanding the ChatML Template for NVILA

Step 3: Building the Multimodal Data Processor

The Processor’s Skeleton

From Raw Input to Processed Data

Step 4: Create the Core Model Definition

Adapting Attention Mechanisms

Handling Multimodal Inputs with padinputids

Handling Image Features

Defining the forward pass

Implementing loadweights

Step 5: Add Integration Tests

Conclusion

Acknowledgements

How to support new VLMs into SGLang: A Case Study with NVILA

The world of LLMs is evolving at a remarkable pace, with Visual Language Models (VLMs) at the forefront of this revolution. These models power applications that can understand and reason about both images and text. There are tons of new VLM models emerging daily, and we want to integrate them into SGLang to leverage its high-speed throughput. Today, we’ll provide a step-by-step walkthrough for integrating new VLMs into the SGLang ecosystem, using the recent NVILA model as a real-world case study.

Accelerating the NVILA Visual Language Model with SGLang

The benchmarks below compare the original VILA implementation against SGLang with different levels of concurrency

In real world VLM development, we focus on two important metrics to evaluate a serving systems’ performance: Throughput (Token per Second, TPS) and and Time to First Token (TTFT).

For TPS, higher throughput means the system can generate more tokens simultaneously . SGLang's RadixAttention allows for efficient batching of requests, dramatically increasing the number of tokens generated per second. With a concurrency of 8, SGLang achieves over 4.4x higher throughput.

RadixAttention

For TTFT, a lower value means users get a faster response to receive the first token. SGLang's memory optimizations and efficient kernel implementations significantly reduce prefill latency. The benchmark shows SGLang responses up to 2.2x faster when concurrency is 8.

These performance gains make SGLang an excellent choice for deploying demanding VLMs like NVILA in production environments, and now can be easily deployed for sglang version ≥ 0.4.8

LMSYS：Blog（Chatbot Arena 团队）

导出 Markdown

在 SGLang 中支持新 VLMs：NVILA 案例研究

2025-07-16 00:00·352天前

阅读原文· lmsys.org

AI 摘要

原文 · 保持原样，未翻译

Contents

Accelerating the NVILA Visual Language Model with SGLang

The Big Picture: How VLMs like NVILA Work

Supporting New Models in SGLang

Step 1: Register the Model as Multimodal

Step 2: Register a New Chat Template

Understanding the ChatML Template for NVILA

Step 3: Building the Multimodal Data Processor

The Processor’s Skeleton

From Raw Input to Processed Data

Step 4: Create the Core Model Definition

Adapting Attention Mechanisms

Handling Multimodal Inputs with padinputids

Handling Image Features

Defining the forward pass

在 SGLang 中支持新 VLMs：NVILA 案例研究

在 SGLang 中支持新 VLMs：NVILA 案例研究

python/sglang/srt/configs/model_config.py # ... existing code ... multimodal_model_archs = [ # ... existing code ... "VILAForConditionalGeneration", ] # ... existing code ...

python/sglang/srt/configs/model_config.py # ... existing code ... multimodal_model_archs = [ # ... existing code ... "VILAForConditionalGeneration", ] # ... existing code ...

python/sglang/srt/conversation.py # ... existing code ... @register_conv_template_matching_function def match_vila(model_path: str): # ... existing code ... if re.search(r"vila", model_path, re.IGNORECASE): return "chatml"

python/sglang/srt/conversation.py # ... existing code ... @register_conv_template_matching_function def match_vila(model_path: str): # ... existing code ... if re.search(r"vila", model_path, re.IGNORECASE): return "chatml"

python/sglang/srt/conversation.py # ... existing code ... register_conv_template( Conversation( name="vila", system_template="system\\n{system_message}", system_message="You are a helpful assistant.", roles=("user", "assistant"), sep_style=SeparatorStyle.CHATML, sep="", stop_str=["", ""], ) )

python/sglang/srt/conversation.py # ... existing code ... register_conv_template( Conversation( name="vila", system_template="system\\n{system_message}", system_message="You are a helpful assistant.", roles=("user", "assistant"), sep_style=SeparatorStyle.CHATML, sep="", stop_str=["", ""], ) )

python/sglang/srt/configs/model_config.py # ... existing code ... multimodal_model_archs = [ # ... existing code ... "VILAForConditionalGeneration", ] # ... existing code ...

python/sglang/srt/configs/model_config.py # ... existing code ... multimodal_model_archs = [ # ... existing code ... "VILAForConditionalGeneration", ] # ... existing code ...

python/sglang/srt/conversation.py # ... existing code ... @register_conv_template_matching_function def match_vila(model_path: str): # ... existing code ... if re.search(r"vila", model_path, re.IGNORECASE): return "chatml"

python/sglang/srt/conversation.py # ... existing code ... @register_conv_template_matching_function def match_vila(model_path: str): # ... existing code ... if re.search(r"vila", model_path, re.IGNORECASE): return "chatml"

python/sglang/srt/conversation.py # ... existing code ... register_conv_template( Conversation( name="vila", system_template="system\\n{system_message}", system_message="You are a helpful assistant.", roles=("user", "assistant"), sep_style=SeparatorStyle.CHATML, sep="", stop_str=["", ""], ) )

python/sglang/srt/conversation.py # ... existing code ... register_conv_template( Conversation( name="vila", system_template="system\\n{system_message}", system_message="You are a helpful assistant.", roles=("user", "assistant"), sep_style=SeparatorStyle.CHATML, sep="", stop_str=["", ""], ) )