面向大型语言模型的高级量化算法

2026-05-02 03:07·62天前·lastdong

AI 摘要

英特尔开源了面向大型语言模型的高级量化算法AutoRound。该算法通过改进的量化策略，能在保持模型性能的同时显著降低存储与计算需求，支持将模型权重压缩至低至3/4比特。相比传统方法，它在多个基准测试中实现了更高的精度，尤其适用于资源受限的部署场景。项目代码已在GitHub发布，并获得开发者社区关注。

原文 · 未翻译

🚀 What is AutoRound?

AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers SignRoundV1 and SignRoundV2 for more details. For usage instructions, please refer to the User Guide.

🆕 What's New

[2026/05] We provide free devices for calibration-free quantization via pure RTN mode; please visit Intel Low Bit Open LLM Leaderboard for more details.

[2026/05] Model free quantization is available, auto-round-rtn will now default to using the model-free approach: Doc.

auto-round-rtn

[2026/03] Block-wise FP8 quantization is available and rtn mode is recommended. auto-round-rtn --scheme FP8_BLOCK.

auto-round-rtn --scheme FP8_BLOCK

[2026/03] MTP layer quantization has been supported in this PR

[2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.

enable_alg_ext

[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.

[2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.

--enable_alg_ext

[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.

Hacker News 热门（buzzing.cc 中文翻译）

67导出 Markdown

面向大型语言模型的高级量化算法

2026-05-02 03:07·62天前·lastdong

阅读原文· github.com

AI 摘要

原文 · 保持原样，未翻译

🚀 What is AutoRound?

🆕 What's New

[2026/05] We provide free devices for calibration-free quantization via pure RTN mode; please visit Intel Low Bit Open LLM Leaderboard for more details.

面向大型语言模型的高级量化算法

面向大型语言模型的高级量化算法

CPU(Xeon)/GPU(CUDA) pip install . # HPU(Gaudi) python setup.py install hpu # XPU(Intel GPU) pip install torch --index-url https://download.pytorch.org/whl/xpu pip install .

Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower auto-round-best \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" \ --low_gpu_mem_usage

2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2 auto-round-light \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"

Optimized RTN (iters=0, opt_rtn enabled); fast baseline auto-round-opt-rtn \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"

Pure RTN (iters=0, no AutoRound optimization); fastest, lowest memory # auto-routes to model-free mode for supported INT WOQ schemes auto-round-rtn \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"

CPU(Xeon)/GPU(CUDA) pip install . # HPU(Gaudi) python setup.py install hpu # XPU(Intel GPU) pip install torch --index-url https://download.pytorch.org/whl/xpu pip install .

Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower auto-round-best \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" \ --low_gpu_mem_usage

2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2 auto-round-light \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"

Optimized RTN (iters=0, opt_rtn enabled); fast baseline auto-round-opt-rtn \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"

Pure RTN (iters=0, no AutoRound optimization); fastest, lowest memory # auto-routes to model-free mode for supported INT WOQ schemes auto-round-rtn \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"