KVarN：华为开发的用于 KV-cache 量化的原生 vLLM 后端

2026-06-05 06:04·28天前·theanonymousone

AI 摘要

华为发布 KVarN，一个原生 vLLM 后端，专门用于键值缓存（KV-cache）量化。项目已在 GitHub 上公开，在 Hacker News 上获得 100 点热度。

原文 · 未翻译

⚡️ Built for agentic and long-context workloads.

💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.

🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.

🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.

Why KVarN (Variance Normalized KV-Cache)?

kvarn /kvɑːɳ/ · noun (Swedish) A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.

kvarn /kvɑːɳ/ · noun (Swedish)

A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.

KV-cache quantization usually comes with a catch. As the vLLM TurboQuant blog shows, existing methods buy extra KV-cache capacity but give up throughput (TurboQuant reports 40 to 52% lower throughput for 2.3-3.7x capacity), and aggressive low-bit quantization also tends to cost accuracy. Losing both speed and quality is the main reason KV-cache quantization is rarely turned on in production.

KVarN is built to keep both. On Qwen3-32B (AIME25, 16K-context burst, TP=2) it matches FP16 accuracy and beats its throughput while delivering ~4× the KV-cache capacity:

KVarN stays in the upper-right corner the blog's methods can't reach: FP16-level accuracy, FP16-or-better throughput, and several times the context.

Quickstart

KVarN ships as a vLLM fork. Install it like vLLM, then select the KVarN KV-cache dtype.

1. Clone git clone https://github.com/huawei-csl/KVarN.git cd KVarN # 2. Install (uses the upstream precompiled wheel; KVarN kernels are Triton, JIT-compiled at runtime) VLLM_USE_PRECOMPILED=1 pip install -e .

from vllm import LLM, SamplingParams llm = LLM( model="Qwen/Qwen3-32B", dtype="float16", # KVarN runs in float16 kv_cache_dtype="kvarn_k4v2_g128", # enable KVarN block_size=128, # KVarN tile size ) print(llm.generate("Explain KV-cache quantization in one sentence.", SamplingParams(max_tokens=64))[0].outputs[0].text)

Hacker News 热门（buzzing.cc 中文翻译）

69导出 Markdown