AutoRound 与 SGLang 正式集成，实现低比特量化模型高效推理

2025-11-14 00:00·231天前

AI 摘要

AutoRound 与 SGLang 宣布合作，支持 INT2-INT8 低比特量化模型的高效推理部署。基于符号梯度优化算法，AutoRound 在 INT2 精度下准确率较主流基线提升 2.1 倍，单 GPU 量化 72B 模型仅需 37 分钟。开发者可将 GPTQ、AWQ 或 GGUF 格式的量化模型直接部署至 SGLang v0.5.4.post2+，兼容 LLM、VLM 及 MoE 架构，在最小精度损失下显著降低推理延迟。

原文 · 未翻译

Contents

Overview

What Is AutoRound?

AutoRound Highlights

Integration Overview

Quantize with AutoRound

1.1 API Usage

1.2 CMD Usage

Deploying with SGLang

2.1 OpenAI-Compatible Inference Usage

2.2 Offline Engine API Inference Usage

Quantization Roadmap

Conclusion

🚀 AutoRound Meets SGLang: Enabling Quantized Model Inference with AutoRound

Overview

We are thrilled to announce an official collaboration between SGLang and AutoRound, enabling low-bit quantization for efficient LLM inference.

Through this integration, developers can now quantize large models with AutoRound’s signed-gradient optimization and directly deploy them in SGLang’s efficient runtime, achieving low-bit model inference with minimal accuracy loss and significant latency reduction.

What Is AutoRound?

AutoRound is an advanced post-training quantization (PTQ) toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, enabling accurate low-bit quantization (e.g., INT2 - INT8) with minimal accuracy loss in most scenarios. For example, at INT2 precision, it outperforms popular baselines by up to 2.1x higher in relative accuracy. At INT4 precision, AutoRound continues to hold a competitive edge in most cases. The image below provides an overview of the core algorithm in AutoRound.

Full technical details are presented in the AutoRound paper:

👉 Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

AutoRound algorithm overview

LMSYS：Blog（Chatbot Arena 团队）

导出 Markdown

AutoRound 与 SGLang 正式集成，实现低比特量化模型高效推理

2025-11-14 00:00·231天前

阅读原文· lmsys.org

AI 摘要

原文 · 保持原样，未翻译

Contents

Overview

What Is AutoRound?

AutoRound Highlights

Integration Overview

Quantize with AutoRound

1.1 API Usage

1.2 CMD Usage

Deploying with SGLang

2.1 OpenAI-Compatible Inference Usage

2.2 Offline Engine API Inference Usage

Quantization Roadmap

Conclusion

🚀 AutoRound Meets SGLang: Enabling Quantized Model Inference with AutoRound

Overview

We are thrilled to announce an official collaboration between SGLang and AutoRound, enabling low-bit quantization for efficient LLM inference.