面向推理模型的价值感知随机KV缓存淘汰策略
阅读原文· arxiv.org推理模型通过延长思考链提高准确率,但长输出导致内存与计算瓶颈。现有KV缓存淘汰方法因准确率常不及保留完整缓存的稀疏注意力方法而受限。研究发现,淘汰少量大数值价值状态会导致模型陷入重复推理循环;引入随机性则能提升缓存多样性以改善准确率。基于此,本文提出无需训练的“价值感知随机KV缓存淘汰”方案。在Qwen3模型上的实验表明,该方法进行4倍缓存压缩时,在六个推理任务上的平均准确率高于同等稀疏度下的SOTA选择方法,并比最强淘汰方法提升超过4%。
Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.