KV Packet：面向 LLM 的免重新计算上下文无关 KV 缓存方案

2026-04-14 08:00·80天前

AI 摘要

研究团队提出 KV Packet 框架，通过轻量级可训练软 token 适配器将缓存文档封装为不可变"数据包"，实现 KV 缓存的免重新计算上下文无关重用。该方法基于自监督蒸馏训练弥合上下文不连续性，在 Llama-3.1 和 Qwen2.5 上的实验表明，其计算开销（FLOPs）接近零，首 token 生成时间（TTFT）低于 CacheBlend、EPIC 等部分重新计算基线，同时 F1 分数与完全重新计算方案持平。

原文 · 未翻译

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

KV Packet：面向 LLM 的免重新计算上下文无关 KV 缓存方案

2026-04-14 08:00·80天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译