# 大规模端到端上下文压缩

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmq63xg0l04wwsl5iosivzf44
- 原文链接：https://arxiv.org/abs/2606.09659

## AI 摘要

长上下文语言模型推理受KV缓存内存瓶颈制约。现有压缩方法或大幅降低质量，或耗时耗算力。本文通过架构搜索和从头预训练，在350B tokens上持续预训练了0.6B编码器、4B解码器的模型家族，支持1:4、1:8、1:16压缩比，命名为Latent Context Language Models (LCLMs)。该家族在通用任务性能、压缩速度和峰值内存上提升了帕累托前沿，并能作为长时程智能体的高效骨干，快速扫描压缩后的长上下文并按需展开相关片段。

## 正文

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.