加速 SGLang 推理：原生集成 NVIDIA Model Optimizer 实现无缝量化与部署（12月2日更新）

2025-12-02 00:00·213天前

AI 摘要

SGLang 最新版本原生集成 NVIDIA Model Optimizer，支持通过直接 API 调用实现模型量化与部署。新功能将原本复杂的多步骤流程简化为量化、导出、部署三步，支持 NVFP4、MXFP4、FP8 等低精度格式。与原始 FP8 基线相比，优化后的模型在 Blackwell 架构上可实现高达 2 倍的每 GPU 吞吐量提升，显著降低延迟与内存占用。

原文 · 未翻译

Contents

What’s New: Direct ModelOpt APIs in SGLang

Performance Outcomes

How to Get Started

Conclusion

Acknowledgement

Boost SGLang Inference: Native NVIDIA Model Optimizer Integration for Seamless Quantization and Deployment

(Updated on Dec 2)

We are thrilled to announce a major new feature in SGLang: native support for NVIDIA Model Optimizer quantization! This integration streamlines the entire model optimization and deployment process, allowing you to go from a full-precision model to a high-performance, quantized endpoint entirely within the SGLang ecosystem.

Serving large language models efficiently is one of the biggest challenges in production. Model quantization is a critical technique for reducing the memory footprint and increasing inference speed of a model. Prior to this feature the process required multi-step workflows and separate tools for model optimization and deployment.

With our latest updates (via PRs #7149, #9991, and #10154), we’ve eliminated that complexity.

The optimizations from Model Optimizer and SGLang can deliver up to 2x better per GPU throughput comparing NVFP4 and FP8 inference.

What’s New: Direct ModelOpt APIs in SGLang

SGLang now integrates NVIDIA's Model Optimizer directly, allowing you to call its powerful quantization APIs from your SGLang code.

This new capability unlocks a simple, three-step workflow:

Quantize: Use the new SGLang-ModelOpt interface to apply state-of-the-art quantization techniques that enable accelerated low-precision inference in NVFP4, MXFP4, FP8, etc.

LMSYS：Blog（Chatbot Arena 团队）

导出 Markdown

加速 SGLang 推理：原生集成 NVIDIA Model Optimizer 实现无缝量化与部署（12月2日更新）

2025-12-02 00:00·213天前

阅读原文· lmsys.org

AI 摘要

原文 · 保持原样，未翻译

Contents

What’s New: Direct ModelOpt APIs in SGLang

Performance Outcomes

How to Get Started

Conclusion

Acknowledgement

Boost SGLang Inference: Native NVIDIA Model Optimizer Integration for Seamless Quantization and Deployment

(Updated on Dec 2)

加速 SGLang 推理：原生集成 NVIDIA Model Optimizer 实现无缝量化与部署（12月2日更新）

加速 SGLang 推理：原生集成 NVIDIA Model Optimizer 实现无缝量化与部署（12月2日更新）

Deploy the exported quantized model python -m sglang.launch_server \ --model-path ./quantized_qwen3_8b_fp8 \ --quantization modelopt \ --port 30000 --host 0.0.0.0

Deploy the exported quantized model python -m sglang.launch_server \ --model-path ./quantized_qwen3_8b_fp8 \ --quantization modelopt \ --port 30000 --host 0.0.0.0

Deploy the exported quantized model python -m sglang.launch_server \ --model-path ./quantized_qwen3_8b_fp8 \ --quantization modelopt \ --port 30000 --host 0.0.0.0

Deploy the exported quantized model python -m sglang.launch_server \ --model-path ./quantized_qwen3_8b_fp8 \ --quantization modelopt \ --port 30000 --host 0.0.0.0