# OSCAR：面向2比特KV缓存量化的离线频谱感知协方差旋转

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-18 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpd29t5j00b8slk1r58bkvx5
- 原文链接：https://arxiv.org/abs/2605.17757

## AI 摘要

针对长上下文大语言模型服务中INT2 KV缓存量化精度下降的问题，本文提出OSCAR方法。其核心是通过离线估计注意力实际使用的协方差结构，推导出固定的旋转矩阵和裁剪阈值，使KV缓存量化与下游注意力计算对齐。实验表明，OSCAR显著提升了量化精度：在Qwen3-4B和Qwen3-8B上，其与BF16的差距分别缩小至3.78和1.42个百分点，而朴素旋转方法性能几乎崩溃。该方法在更大模型及128K长上下文测试中表现稳健。在系统层面，OSCAR将KV缓存内存占用降低约8倍，并将大批次吞吐量提升最高达7倍。

## 正文

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.
