💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.
💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.
🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.
🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.
🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.
🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.
Why KVarN (Variance Normalized KV-Cache)?
kvarn /kvɑːɳ/ · noun (Swedish) A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.
kvarn /kvɑːɳ/ · noun (Swedish)
A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.
KV-cache quantization usually comes with a catch. As the vLLM TurboQuant blog shows, existing methods buy extra KV-cache capacity but give up throughput (TurboQuant reports 40 to 52% lower throughput for 2.3-3.7x capacity), and aggressive low-bit quantization also tends to cost accuracy. Losing both speed and quality is the main reason KV-cache quantization is rarely turned on in production.
KVarN is built to keep both. On Qwen3-32B (AIME25, 16K-context burst, TP=2) it matches FP16 accuracy and beats its throughput while delivering ~4× the KV-cache capacity:
KVarN stays in the upper-right corner the blog's methods can't reach: FP16-level accuracy, FP16-or-better throughput, and several times the context.
Quickstart
KVarN ships as a vLLM fork. Install it like vLLM, then select the KVarN KV-cache dtype.
1. Clone git clone https://github.com/huawei-csl/KVarN.git cd KVarN # 2. Install (uses the upstream precompiled wheel; KVarN kernels are Triton, JIT-compiled at runtime) VLLM_USE_PRECOMPILED=1 pip install -e .
from vllm import LLM, SamplingParams llm = LLM( model="Qwen/Qwen3-32B", dtype="float16", # KVarN runs in float16 kv_cache_dtype="kvarn_k4v2_g128", # enable KVarN block_size=128, # KVarN tile size ) print(llm.generate("Explain KV-cache quantization in one sentence.", SamplingParams(max_tokens=64))[0].outputs[0].text)
💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.
💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.
🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.
🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.
🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.
Note: KVarN runs in float16 compute. One vLLM block is one KVarN tile, so the tile / page size equals --block-size. Both 128 (default) and 64 are supported, selected by the matching preset (kvarn_k4v2_g128 / kvarn_k4v2_g64). 128 is the design point; 64 gives finer quantization granularity at the cost of a little KV capacity (more per-tile scale overhead per token), at essentially the same throughput.
Note: KVarN runs in float16 compute. One vLLM block is one KVarN tile, so the tile / page size equals --block-size. Both 128 (default) and 64 are supported, selected by the matching preset (kvarn_k4v2_g128 / kvarn_k4v2_g64). 128 is the design point; 64 gives finer quantization granularity at the cost of a little KV capacity (more per-tile scale overhead per token), at essentially the same throughput.
float16
--block-size
kvarn_k4v2_g128
kvarn_k4v2_g64
Tip (capacity): KVarN realizes its full KV-cache capacity when there is room to amortize a small fixed decode workspace. On multi-GPU or generous --gpu-memory-utilization setups this is automatic. On a tight single-GPU budget, vLLM's CUDA-graph memory profiler can over-reserve and shrink the KV pool; set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 (and/or raise --gpu-memory-utilization) to recover the full capacity.
Tip (capacity): KVarN realizes its full KV-cache capacity when there is room to amortize a small fixed decode workspace. On multi-GPU or generous --gpu-memory-utilization setups this is automatic. On a tight single-GPU budget, vLLM's CUDA-graph memory profiler can over-reserve and shrink the KV pool; set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 (and/or raise --gpu-memory-utilization) to recover the full capacity.
--gpu-memory-utilization
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
--gpu-memory-utilization
MLA models
KVarN also supports Multi-head Latent Attention (MLA). To the best of our knowledge, KVarN is the first vLLM-compatible sub-8-bit KV-cache quantization method to support MLA-based models: it quantizes the compressed KV latent to int4. Use the same --kv-cache-dtype you use on dense models; on an MLA model it automatically routes to the MLA latent path (no code or env changes), and the fast path is on by default:
The win is 2.77× KV capacity at ~parity accuracy. MLA's latent is already tiny so KVarN is not a latency play there, but it lets you fit far more concurrent context in the same memory.
Hybrid models (Mamba / linear-attention)
KVarN supports hybrid models that interleave linear-attention or Mamba layers with standard full-attention layers, such as Qwen3.6-27B. KVarN compresses only the full-attention layers (the layers that hold a KV cache); the linear-attention and Mamba layers keep their own recurrent state and are left untouched. The fp16 decode pool is sized from the full-attention layer count only, so a hybrid model loads with the default flags, no manual pool tuning.
The KV-cache capacity gain applies to the full-attention layers (where the cache lives), so a hybrid model fits far more concurrent context in the same memory at near parity accuracy.
Speculative decoding (MTP and draft models)
KVarN is compatible with speculative decoding, including Multi-Token Prediction (MTP). Pass --speculative-config exactly as you normally would, alongside the KVarN --kv-cache-dtype:
The speculative verify step attends over the full cached context (KVarN reconstructs it from the quantized cache), and a block is committed to the quantized cache only once all of its tokens are accepted, so rejected draft tokens never corrupt history.
Note (weight quantization): KVarN quantizes the KV cache independently of the model weights, so it composes with weight-quantized checkpoints (for example compressed-tensors / AWQ INT4) and MTP at the same time. Validated on Qwen3.6-27B in both bfloat16 and AWQ INT4.
Note (weight quantization): KVarN quantizes the KV cache independently of the model weights, so it composes with weight-quantized checkpoints (for example compressed-tensors / AWQ INT4) and MTP at the same time. Validated on Qwen3.6-27B in both bfloat16 and AWQ INT4.
compressed-tensors
bfloat16
How does KVarN work?
KVarN quantizes the KV cache one fixed-size token tile at a time, walking each tile through the four stages above:
Cache: the raw fp16 KV tile (channels × tokens), straight from attention.
Cache: the raw fp16 KV tile (channels × tokens), straight from attention.
Rotated Cache: a Hadamard rotation along the channel dimension mixes channels so that per-channel outliers are spread out, making the tile easier to quantize. The rotation is orthonormal, so attention scores are preserved.
Rotated Cache: a Hadamard rotation along the channel dimension mixes channels so that per-channel outliers are spread out, making the tile easier to quantize. The rotation is orthonormal, so attention scores are preserved.
Normalized Cache: iterative variance normalization (Sinkhorn-like) alternates column- and row-wise standard-deviation normalization in log space, equalizing variance across the tile and shrinking quantization error before any rounding happens.
Normalized Cache: iterative variance normalization (Sinkhorn-like) alternates column- and row-wise standard-deviation normalization in log space, equalizing variance across the tile and shrinking quantization error before any rounding happens.
Quantized Cache: asymmetric round-to-nearest at low bit-width, with the scales folded back in at read time (keys per channel, values per token).
Quantized Cache: asymmetric round-to-nearest at low bit-width, with the scales folded back in at read time (keys per channel, values per token).
The shipped preset spends more bits on keys than values (kvarn_k4v2_g128: 4-bit keys, 2-bit values). We chose to release this configuration because it meets the strictest accuracy bar, matching FP16, that the most demanding production deployments and vLLM require, while still delivering throughput above FP16.
kvarn_k4v2_g128
Citation
KVarN is the official vLLM implementation of our paper:
@misc{muller2026kvarn, title={KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks}, author={Lorenz K. Muller and Philippe Bich and Chiara Boretti and Hyun-Min Chang and Jiawei Zhuang and Lukas Cavigelli}, year={2026}, eprint={2606.03458}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={http://arxiv.org/abs/2606.03458} }
License and attribution
KVarN is built on vLLM (v0.22.0) and is released under the Apache 2.0 License. The original vLLM README is preserved as README_vLLM.md.
README_vLLM.md
🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.
Why KVarN (Variance Normalized KV-Cache)?
kvarn /kvɑːɳ/ · noun (Swedish) A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.
kvarn /kvɑːɳ/ · noun (Swedish)
A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.
KV-cache quantization usually comes with a catch. As the vLLM TurboQuant blog shows, existing methods buy extra KV-cache capacity but give up throughput (TurboQuant reports 40 to 52% lower throughput for 2.3-3.7x capacity), and aggressive low-bit quantization also tends to cost accuracy. Losing both speed and quality is the main reason KV-cache quantization is rarely turned on in production.
KVarN is built to keep both. On Qwen3-32B (AIME25, 16K-context burst, TP=2) it matches FP16 accuracy and beats its throughput while delivering ~4× the KV-cache capacity:
KVarN stays in the upper-right corner the blog's methods can't reach: FP16-level accuracy, FP16-or-better throughput, and several times the context.
Quickstart
KVarN ships as a vLLM fork. Install it like vLLM, then select the KVarN KV-cache dtype.
1. Clone git clone https://github.com/huawei-csl/KVarN.git cd KVarN # 2. Install (uses the upstream precompiled wheel; KVarN kernels are Triton, JIT-compiled at runtime) VLLM_USE_PRECOMPILED=1 pip install -e .
from vllm import LLM, SamplingParams llm = LLM( model="Qwen/Qwen3-32B", dtype="float16", # KVarN runs in float16 kv_cache_dtype="kvarn_k4v2_g128", # enable KVarN block_size=128, # KVarN tile size ) print(llm.generate("Explain KV-cache quantization in one sentence.", SamplingParams(max_tokens=64))[0].outputs[0].text)
Note: KVarN runs in float16 compute. One vLLM block is one KVarN tile, so the tile / page size equals --block-size. Both 128 (default) and 64 are supported, selected by the matching preset (kvarn_k4v2_g128 / kvarn_k4v2_g64). 128 is the design point; 64 gives finer quantization granularity at the cost of a little KV capacity (more per-tile scale overhead per token), at essentially the same throughput.
Note: KVarN runs in float16 compute. One vLLM block is one KVarN tile, so the tile / page size equals --block-size. Both 128 (default) and 64 are supported, selected by the matching preset (kvarn_k4v2_g128 / kvarn_k4v2_g64). 128 is the design point; 64 gives finer quantization granularity at the cost of a little KV capacity (more per-tile scale overhead per token), at essentially the same throughput.
float16
--block-size
kvarn_k4v2_g128
kvarn_k4v2_g64
Tip (capacity): KVarN realizes its full KV-cache capacity when there is room to amortize a small fixed decode workspace. On multi-GPU or generous --gpu-memory-utilization setups this is automatic. On a tight single-GPU budget, vLLM's CUDA-graph memory profiler can over-reserve and shrink the KV pool; set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 (and/or raise --gpu-memory-utilization) to recover the full capacity.
Tip (capacity): KVarN realizes its full KV-cache capacity when there is room to amortize a small fixed decode workspace. On multi-GPU or generous --gpu-memory-utilization setups this is automatic. On a tight single-GPU budget, vLLM's CUDA-graph memory profiler can over-reserve and shrink the KV pool; set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 (and/or raise --gpu-memory-utilization) to recover the full capacity.
--gpu-memory-utilization
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
--gpu-memory-utilization
MLA models
KVarN also supports Multi-head Latent Attention (MLA). To the best of our knowledge, KVarN is the first vLLM-compatible sub-8-bit KV-cache quantization method to support MLA-based models: it quantizes the compressed KV latent to int4. Use the same --kv-cache-dtype you use on dense models; on an MLA model it automatically routes to the MLA latent path (no code or env changes), and the fast path is on by default:
The win is 2.77× KV capacity at ~parity accuracy. MLA's latent is already tiny so KVarN is not a latency play there, but it lets you fit far more concurrent context in the same memory.
Hybrid models (Mamba / linear-attention)
KVarN supports hybrid models that interleave linear-attention or Mamba layers with standard full-attention layers, such as Qwen3.6-27B. KVarN compresses only the full-attention layers (the layers that hold a KV cache); the linear-attention and Mamba layers keep their own recurrent state and are left untouched. The fp16 decode pool is sized from the full-attention layer count only, so a hybrid model loads with the default flags, no manual pool tuning.
The KV-cache capacity gain applies to the full-attention layers (where the cache lives), so a hybrid model fits far more concurrent context in the same memory at near parity accuracy.
Speculative decoding (MTP and draft models)
KVarN is compatible with speculative decoding, including Multi-Token Prediction (MTP). Pass --speculative-config exactly as you normally would, alongside the KVarN --kv-cache-dtype:
The speculative verify step attends over the full cached context (KVarN reconstructs it from the quantized cache), and a block is committed to the quantized cache only once all of its tokens are accepted, so rejected draft tokens never corrupt history.
Note (weight quantization): KVarN quantizes the KV cache independently of the model weights, so it composes with weight-quantized checkpoints (for example compressed-tensors / AWQ INT4) and MTP at the same time. Validated on Qwen3.6-27B in both bfloat16 and AWQ INT4.
Note (weight quantization): KVarN quantizes the KV cache independently of the model weights, so it composes with weight-quantized checkpoints (for example compressed-tensors / AWQ INT4) and MTP at the same time. Validated on Qwen3.6-27B in both bfloat16 and AWQ INT4.
compressed-tensors
bfloat16
How does KVarN work?
KVarN quantizes the KV cache one fixed-size token tile at a time, walking each tile through the four stages above:
Cache: the raw fp16 KV tile (channels × tokens), straight from attention.
Cache: the raw fp16 KV tile (channels × tokens), straight from attention.
Rotated Cache: a Hadamard rotation along the channel dimension mixes channels so that per-channel outliers are spread out, making the tile easier to quantize. The rotation is orthonormal, so attention scores are preserved.
Rotated Cache: a Hadamard rotation along the channel dimension mixes channels so that per-channel outliers are spread out, making the tile easier to quantize. The rotation is orthonormal, so attention scores are preserved.
Normalized Cache: iterative variance normalization (Sinkhorn-like) alternates column- and row-wise standard-deviation normalization in log space, equalizing variance across the tile and shrinking quantization error before any rounding happens.
Normalized Cache: iterative variance normalization (Sinkhorn-like) alternates column- and row-wise standard-deviation normalization in log space, equalizing variance across the tile and shrinking quantization error before any rounding happens.
Quantized Cache: asymmetric round-to-nearest at low bit-width, with the scales folded back in at read time (keys per channel, values per token).
Quantized Cache: asymmetric round-to-nearest at low bit-width, with the scales folded back in at read time (keys per channel, values per token).
The shipped preset spends more bits on keys than values (kvarn_k4v2_g128: 4-bit keys, 2-bit values). We chose to release this configuration because it meets the strictest accuracy bar, matching FP16, that the most demanding production deployments and vLLM require, while still delivering throughput above FP16.
kvarn_k4v2_g128
Citation
KVarN is the official vLLM implementation of our paper:
@misc{muller2026kvarn, title={KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks}, author={Lorenz K. Muller and Philippe Bich and Chiara Boretti and Hyun-Min Chang and Jiawei Zhuang and Lukas Cavigelli}, year={2026}, eprint={2606.03458}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={http://arxiv.org/abs/2606.03458} }
License and attribution
KVarN is built on vLLM (v0.22.0) and is released under the Apache 2.0 License. The original vLLM README is preserved as README_vLLM.md.