论文提出了“效率前沿”框架,用于统一评估LLM上下文管理策略的成本与性能权衡。核心发现是,在部署时选择合适的上下文方法可使token使用量减少约25%,在部分记忆复用场景下可降低超50%成本,且答案质量损失较小。研究指出,上下文长度存在收益递减,后增加的token成本高但收益小。在5000个HotpotQA问题的测试中,轻量检索适合低复用率,记忆压缩在高复用率下更优,而全上下文提示仍是获取最高性能所需。
This paper shows how LLMs can use shorter context more cheaply without losing much answer quality.
Shows choosing the right context method for the deployment setting can cut token use by about 25% at similar quality, and by over 50% in some reused-memory cases.
The problem is that long context gives a model more information, but every extra token costs money and compute, and the extra context often brings smaller gains.
Longer context has diminishing returns, and the expensive tokens are often the ones added after the model already has enough signal.
The authors propose an Efficiency Frontier, which compares context strategies by looking at answer quality and token cost together instead of treating them as separate scores.