# 本文展示了大语言模型如何在保持答案质量的同时，通过使用更短的上下文来降低成本。

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-05-29 17:57
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmpqrn1r107ncslnodwlf2yjj
- 原文链接：https://x.com/rohanpaul_ai/status/2060299358700978267

## AI 摘要

论文提出了“效率前沿”框架，用于统一评估LLM上下文管理策略的成本与性能权衡。核心发现是，在部署时选择合适的上下文方法可使token使用量减少约25%，在部分记忆复用场景下可降低超50%成本，且答案质量损失较小。研究指出，上下文长度存在收益递减，后增加的token成本高但收益小。在5000个HotpotQA问题的测试中，轻量检索适合低复用率，记忆压缩在高复用率下更优，而全上下文提示仍是获取最高性能所需。

## 正文

This paper shows how LLMs can use shorter context more cheaply without losing much answer quality.

Shows choosing the right context method for the deployment setting can cut token use by about 25% at similar quality， and by over 50% in some reused-memory cases.

The problem is that long context gives a model more information， but every extra token costs money and compute， and the extra context often brings smaller gains.

Longer context has diminishing returns， and the expensive tokens are often the ones added after the model already has enough signal.

The authors propose an Efficiency Frontier， which compares context strategies by looking at answer quality and token cost together instead of treating them as separate scores.

The key idea is that some methods are cheap per question， like retrieval， while others spend more upfront， like memory compression， but become cheaper when the same processed context is reused many times.

They tested this on 5，000 HotpotQA questions， where the model has to combine facts across documents while ignoring distracting text.

The main result is that the best context strategy changes with the setting： lightweight retrieval works best when reuse is low， memory compression becomes better when reuse is high， and full-context prompting is still needed for the highest scores.

----

Link - arxiv. org/abs/2605.23071

Title： "The Efficiency Frontier： A Unified Framework for Cost-Performance Optimization in LLM Context Management"