CONF-KV：基于置信度的 KV 缓存驱逐与混合精度存储方案

2026-05-24 08:00·40天前

AI 摘要

CONF-KV 是一种面向长序列大语言模型推理的 KV 缓存管理器。其核心是将下一个 token 的预测分布转化为标量置信度分数，以此动态分配每一步的缓存预算：在模型不确定时保留更多上下文，自信时则积极剪枝。缓存内 token 按累积注意力质量与近期性综合排序，并受保护近期窗口以维持局部连贯性。该方案结合了分块在线 softmax 注意力、FP16/INT8 混合精度存储与金字塔式逐层预算分配。实验表明，在生成长度达 4K 时，其内存占用接近固定的 512 token 滑动窗口。在需要检索 32K token 的 Needle-in-a-Haystack 任务中，CONF-KV 达到 91.4% 的准确率，远高于滑动窗口（53.8%）和 H2O（80.6%）。在 75 个 VisualWebArena 任务中，它以 2.8 倍更低的峰值内存，保留了完整 KV 缓存 95.3% 的成功率。

原文 · 未翻译

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.

HuggingFace Daily Papers（社区热门论文）

53导出 Markdown

CONF-KV：基于置信度的 KV 缓存驱逐与混合精度存储方案

2026-05-24 08:00·40天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译