AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers SignRoundV1 and SignRoundV2 for more details. For usage instructions, please refer to the User Guide.
🆕 What's New
[2026/05] We provide free devices for calibration-free quantization via pure RTN mode; please visit Intel Low Bit Open LLM Leaderboard for more details.
[2026/05] We provide free devices for calibration-free quantization via pure RTN mode; please visit Intel Low Bit Open LLM Leaderboard for more details.
[2026/05] Model free quantization is available, auto-round-rtn will now default to using the model-free approach: Doc.
[2026/05] Model free quantization is available, auto-round-rtn will now default to using the model-free approach: Doc.
auto-round-rtn
[2026/03] Block-wise FP8 quantization is available and rtn mode is recommended. auto-round-rtn --scheme FP8_BLOCK.
[2026/03] Block-wise FP8 quantization is available and rtn mode is recommended. auto-round-rtn --scheme FP8_BLOCK.
auto-round-rtn --scheme FP8_BLOCK
[2026/03] MTP layer quantization has been supported in this PR
[2026/03] MTP layer quantization has been supported in this PR
[2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.
[2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.
enable_alg_ext
[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.
[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.
[2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.
[2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.
--enable_alg_ext
[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.
AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers SignRoundV1 and SignRoundV2 for more details. For usage instructions, please refer to the User Guide.
🆕 What's New
[2026/05] We provide free devices for calibration-free quantization via pure RTN mode; please visit Intel Low Bit Open LLM Leaderboard for more details.
[2026/05] We provide free devices for calibration-free quantization via pure RTN mode; please visit Intel Low Bit Open LLM Leaderboard for more details.
[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.
[2025/10] A mixed precision algorithm is available to generate schemes in minutes: Usage, Accuracy.
[2025/10] A mixed precision algorithm is available to generate schemes in minutes: Usage, Accuracy.
[2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.
[2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.
[2025/08] An improved INT2 algorithm is available via --enable_alg_ext: Accuracy
[2025/08] An improved INT2 algorithm is available via --enable_alg_ext: Accuracy
--enable_alg_ext
[2025/07] GGUF format is supported: Usage.
[2025/07] GGUF format is supported: Usage.
[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.
[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.
[2025/05] AutoRound has been integrated into Transformers: Blog.
[2025/05] AutoRound has been integrated into Transformers: Blog.
[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.
[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.
✨ Key Features
✅ Superior Accuracy Delivers strong performance even at 2–3 bits example models, with leading results at 4 bits benchmark.
✅ Ecosystem Integration Seamlessly works with Transformers, vLLM, SGLang and more.
✅ Multiple Formats Export Support AutoRound, AutoAWQ, AutoGPTQ, and GGUF for maximum compatibility. Details are shown in export formats
✅ Fast Mixed Bits/Dtypes Scheme Generation Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy results and user guide.
✅ Optimized Round-to-Nearest Mode Use --iters 0 for fast quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode
--iters 0
✅ Affordable Quantization Cost Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs
✅ 10+ VLMs Support Out-of-the-box quantization for 10+ vision-language models example models, support matrix
✅ Multiple Recipes Choose from auto-round-best, auto-round, auto-round-light, auto-round-opt-rtn (optimized RTN) and auto-round-rtn (pure RTN, fastest baseline) to suit your needs. Details are shown in quantization recipes
auto-round-best
auto-round
auto-round-light
auto-round-opt-rtn
auto-round-rtn
✅ Advanced Utilities Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends.
✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as MXFP, NVFP, W8A8, and more.
If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results.
If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results.
CLI Usage
The full list of supported arguments is provided by calling auto-round -h on the terminal.
auto-round -h
ModelScope is supported for model downloads, simply set AR_USE_MODELSCOPE=1.
ModelScope is supported for model downloads, simply set AR_USE_MODELSCOPE=1.
We offer another two recipes, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.
auto-round-best
auto-round-light
Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower auto-round-best \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" \ --low_gpu_mem_usage
2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2 auto-round-light \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"
Pure RTN (iters=0, no AutoRound optimization); fastest, lowest memory # auto-routes to model-free mode for supported INT WOQ schemes auto-round-rtn \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"
In conclusion, we recommend using auto-round for W4A16 and auto-round-best with enable_alg_ext for W2A16. However, you may adjust the configuration to suit your specific requirements and available resources.
enable_alg_ext
API Usage
from auto_round import AutoRound # Load a model (supports FP8/BF16/FP16/FP32) model_name_or_path = "Qwen/Qwen3-0.6B" # Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc. ar = AutoRound(model_name_or_path, scheme="W4A16") # Highest accuracy (4–5× slower). # low_gpu_mem_usage=True saves ~20GB VRAM but runs ~30% slower. # ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True) # Faster quantization (2–3× speedup) with slight accuracy drop at W4G128. # ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3) # Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc. ar.quantize_and_save(output_dir="./qmodel", format="auto_round")
scheme (str|dict|AutoScheme): The predefined quantization keys, e.g. W4A16, MXFP4, NVFP4, GGUF:Q4_K_M. For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format.
scheme
W4A16
MXFP4
NVFP4
GGUF:Q4_K_M
bits (int): Number of bits for quantization (default is None). If not None, it will override the scheme setting.
bits
None
group_size (int): Size of the quantization group (default is None). If not None, it will override the scheme setting.
group_size
None
sym (bool): Whether to use symmetric quantization (default is None). If not None, it will override the scheme setting.
sym
None
layer_config (dict): Configuration for layer_wise scheme (default is None), mainly for customized mixed schemes.
layer_config
None
enable_alg_ext (bool): [Experimental Feature] Only for iters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.
enable_alg_ext (bool): [Experimental Feature] Only for iters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.
enable_alg_ext
iters>0
False
disable_opt_rtn (bool|None): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is None. If None, it defaults to False in most cases to improve accuracy, but may be set to True due to known issues.
disable_opt_rtn (bool|None): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is None. If None, it defaults to False in most cases to improve accuracy, but may be set to True due to known issues.
disable_opt_rtn
None
False
True
iters (int): Number of tuning iterations (default is 200). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.
iters
200
lr (float): The learning rate for rounding value (default is None). When None, it will be set to 1.0/iters automatically.
lr
None
1.0/iters
batch_size (int): Batch size for training (default is 8). 4 is also commonly used.
batch_size
8
enable_deterministic_algorithms (bool): Whether to enable deterministic algorithms for reproducibility (default is False).
enable_deterministic_algorithms
False
dataset (str|list|tuple|torch.utils.data.DataLoader): The dataset for tuning (default is "NeelNanda/pile-10k"). Supports local JSON files and dataset combinations, e.g. "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test".
nsamples (int): Number of samples for tuning (default is 128).
nsamples
128
seqlen (int): Data length of the sequence for tuning (default is 2048).
seqlen
2048
enable_torch_compile (bool): If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.
enable_torch_compile
low_gpu_mem_usage (bool): Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is False).
low_gpu_mem_usage
False
low_cpu_mem_usage (bool): [Experimental Feature]Whether to enable saving immediately to reduce ram usage (default is True).
low_cpu_mem_usage
True
device_map (str|dict|int): The device to be used for tuning, e.g., auto, cpu, cuda, 0,1,2 (default is 0). When using auto, it will try to use all available GPUs.
AutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes. Please refer to the user guide for more details on AutoScheme.
from auto_round import AutoRound, AutoScheme model_name = "Qwen/Qwen3-8B" avg_bits = 3.0 scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) layer_config = {"lm_head": "GGUF:Q6_K"} # Change iters to 200 for non-GGUF schemes ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) ar.quantize_and_save()
avg_bits (float): Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation.
avg_bits
options (str | list[str] | list[QuantizationScheme]): Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., "W4A16,W2A16"), a list of strings (e.g., ["W4A16", "W2A16"]), or a list of QuantizationScheme objects.
options
"W4A16,W2A16"
["W4A16", "W2A16"]
QuantizationScheme
ignore_scale_zp_bits (bool): Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: False).
ignore_scale_zp_bits
False
shared_layers (Iterable[Iterable[str]], optional): Only supported in API usage. Defines groups of layers that share quantization settings.
shared_layers
batch_size (int, optional): Only supported in API usage. Can be set to 1 to reduce VRAM usage at the expense of longer tuning time.
batch_size
1
API Usage for VLMs
This feature is experimental and may be subject to changes.
By default, AutoRound only quantize the text module of VLMs and uses NeelNanda/pile-10k for calibration. To quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature is limited. For more information, please refer to the AutoRound readme.
NeelNanda/pile-10k
quant_nontext_module
from auto_round import AutoRound # Load the model model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct" # Quantize the model ar = AutoRound(model_name_or_path, scheme="W4A16") output_dir = "./qmodel" ar.quantize_and_save(output_dir)
Model Inference
vLLM (CPU/Intel GPU/CUDA)
from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95) model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound" llm = LLM(model=model_name) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
[2026/05] Model free quantization is available, auto-round-rtn will now default to using the model-free approach: Doc.
[2026/05] Model free quantization is available, auto-round-rtn will now default to using the model-free approach: Doc.
auto-round-rtn
[2026/03] Block-wise FP8 quantization is available and rtn mode is recommended. auto-round-rtn --scheme FP8_BLOCK.
[2026/03] Block-wise FP8 quantization is available and rtn mode is recommended. auto-round-rtn --scheme FP8_BLOCK.
auto-round-rtn --scheme FP8_BLOCK
[2026/03] MTP layer quantization has been supported in this PR
[2026/03] MTP layer quantization has been supported in this PR
[2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.
[2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.
enable_alg_ext
[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.
[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.
[2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.
[2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.
--enable_alg_ext
[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.
[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.
[2025/10] A mixed precision algorithm is available to generate schemes in minutes: Usage, Accuracy.
[2025/10] A mixed precision algorithm is available to generate schemes in minutes: Usage, Accuracy.
[2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.
[2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.
[2025/08] An improved INT2 algorithm is available via --enable_alg_ext: Accuracy
[2025/08] An improved INT2 algorithm is available via --enable_alg_ext: Accuracy
--enable_alg_ext
[2025/07] GGUF format is supported: Usage.
[2025/07] GGUF format is supported: Usage.
[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.
[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.
[2025/05] AutoRound has been integrated into Transformers: Blog.
[2025/05] AutoRound has been integrated into Transformers: Blog.
[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.
[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.
✨ Key Features
✅ Superior Accuracy Delivers strong performance even at 2–3 bits example models, with leading results at 4 bits benchmark.
✅ Ecosystem Integration Seamlessly works with Transformers, vLLM, SGLang and more.
✅ Multiple Formats Export Support AutoRound, AutoAWQ, AutoGPTQ, and GGUF for maximum compatibility. Details are shown in export formats
✅ Fast Mixed Bits/Dtypes Scheme Generation Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy results and user guide.
✅ Optimized Round-to-Nearest Mode Use --iters 0 for fast quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode
--iters 0
✅ Affordable Quantization Cost Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs
✅ 10+ VLMs Support Out-of-the-box quantization for 10+ vision-language models example models, support matrix
✅ Multiple Recipes Choose from auto-round-best, auto-round, auto-round-light, auto-round-opt-rtn (optimized RTN) and auto-round-rtn (pure RTN, fastest baseline) to suit your needs. Details are shown in quantization recipes
auto-round-best
auto-round
auto-round-light
auto-round-opt-rtn
auto-round-rtn
✅ Advanced Utilities Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends.
✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as MXFP, NVFP, W8A8, and more.
If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results.
If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results.
CLI Usage
The full list of supported arguments is provided by calling auto-round -h on the terminal.
auto-round -h
ModelScope is supported for model downloads, simply set AR_USE_MODELSCOPE=1.
ModelScope is supported for model downloads, simply set AR_USE_MODELSCOPE=1.
We offer another two recipes, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.
auto-round-best
auto-round-light
Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower auto-round-best \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" \ --low_gpu_mem_usage
2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2 auto-round-light \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"
Pure RTN (iters=0, no AutoRound optimization); fastest, lowest memory # auto-routes to model-free mode for supported INT WOQ schemes auto-round-rtn \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"
In conclusion, we recommend using auto-round for W4A16 and auto-round-best with enable_alg_ext for W2A16. However, you may adjust the configuration to suit your specific requirements and available resources.
enable_alg_ext
API Usage
from auto_round import AutoRound # Load a model (supports FP8/BF16/FP16/FP32) model_name_or_path = "Qwen/Qwen3-0.6B" # Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc. ar = AutoRound(model_name_or_path, scheme="W4A16") # Highest accuracy (4–5× slower). # low_gpu_mem_usage=True saves ~20GB VRAM but runs ~30% slower. # ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True) # Faster quantization (2–3× speedup) with slight accuracy drop at W4G128. # ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3) # Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc. ar.quantize_and_save(output_dir="./qmodel", format="auto_round")
scheme (str|dict|AutoScheme): The predefined quantization keys, e.g. W4A16, MXFP4, NVFP4, GGUF:Q4_K_M. For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format.
scheme
W4A16
MXFP4
NVFP4
GGUF:Q4_K_M
bits (int): Number of bits for quantization (default is None). If not None, it will override the scheme setting.
bits
None
group_size (int): Size of the quantization group (default is None). If not None, it will override the scheme setting.
group_size
None
sym (bool): Whether to use symmetric quantization (default is None). If not None, it will override the scheme setting.
sym
None
layer_config (dict): Configuration for layer_wise scheme (default is None), mainly for customized mixed schemes.
layer_config
None
enable_alg_ext (bool): [Experimental Feature] Only for iters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.
enable_alg_ext (bool): [Experimental Feature] Only for iters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.
enable_alg_ext
iters>0
False
disable_opt_rtn (bool|None): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is None. If None, it defaults to False in most cases to improve accuracy, but may be set to True due to known issues.
disable_opt_rtn (bool|None): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is None. If None, it defaults to False in most cases to improve accuracy, but may be set to True due to known issues.
disable_opt_rtn
None
False
True
iters (int): Number of tuning iterations (default is 200). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.
iters
200
lr (float): The learning rate for rounding value (default is None). When None, it will be set to 1.0/iters automatically.
lr
None
1.0/iters
batch_size (int): Batch size for training (default is 8). 4 is also commonly used.
batch_size
8
enable_deterministic_algorithms (bool): Whether to enable deterministic algorithms for reproducibility (default is False).
enable_deterministic_algorithms
False
dataset (str|list|tuple|torch.utils.data.DataLoader): The dataset for tuning (default is "NeelNanda/pile-10k"). Supports local JSON files and dataset combinations, e.g. "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test".
nsamples (int): Number of samples for tuning (default is 128).
nsamples
128
seqlen (int): Data length of the sequence for tuning (default is 2048).
seqlen
2048
enable_torch_compile (bool): If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.
enable_torch_compile
low_gpu_mem_usage (bool): Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is False).
low_gpu_mem_usage
False
low_cpu_mem_usage (bool): [Experimental Feature]Whether to enable saving immediately to reduce ram usage (default is True).
low_cpu_mem_usage
True
device_map (str|dict|int): The device to be used for tuning, e.g., auto, cpu, cuda, 0,1,2 (default is 0). When using auto, it will try to use all available GPUs.
AutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes. Please refer to the user guide for more details on AutoScheme.
from auto_round import AutoRound, AutoScheme model_name = "Qwen/Qwen3-8B" avg_bits = 3.0 scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) layer_config = {"lm_head": "GGUF:Q6_K"} # Change iters to 200 for non-GGUF schemes ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) ar.quantize_and_save()
avg_bits (float): Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation.
avg_bits
options (str | list[str] | list[QuantizationScheme]): Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., "W4A16,W2A16"), a list of strings (e.g., ["W4A16", "W2A16"]), or a list of QuantizationScheme objects.
options
"W4A16,W2A16"
["W4A16", "W2A16"]
QuantizationScheme
ignore_scale_zp_bits (bool): Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: False).
ignore_scale_zp_bits
False
shared_layers (Iterable[Iterable[str]], optional): Only supported in API usage. Defines groups of layers that share quantization settings.
shared_layers
batch_size (int, optional): Only supported in API usage. Can be set to 1 to reduce VRAM usage at the expense of longer tuning time.
batch_size
1
API Usage for VLMs
This feature is experimental and may be subject to changes.
By default, AutoRound only quantize the text module of VLMs and uses NeelNanda/pile-10k for calibration. To quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature is limited. For more information, please refer to the AutoRound readme.
NeelNanda/pile-10k
quant_nontext_module
from auto_round import AutoRound # Load the model model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct" # Quantize the model ar = AutoRound(model_name_or_path, scheme="W4A16") output_dir = "./qmodel" ar.quantize_and_save(output_dir)
Model Inference
vLLM (CPU/Intel GPU/CUDA)
from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95) model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound" llm = LLM(model=model_name) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")