# 计划不持久：为何上下文管理对LLM智能体至关重要

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-22 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmqtrgr9r067asl0edflkmi9x
- 原文链接：https://arxiv.org/abs/2606.22953

## AI 摘要

研究揭示标准LLM智能体依赖上下文窗口保持计划信息，而非将其内化为持久状态。在Llama-3.1-70B上，计划信号写入一步后从0.453骤降4.1倍，HotpotQA下降12.4倍。推理模型（DeepSeek-R1-Distill-Llama-70B）的思维链痕迹会重新推导计划，严格剥离后恢复样本内+163%、样本外+153%信号，非推理模型仅+4.8%。基于Llama训练的分类器迁移到R1上AUROC为0.748，R1专用分类器达1.000。压力测试中，丢弃计划导致ALFWorld成功率下降34.7个百分点。该框架证明关键信息仅驻留于上下文而非持久存在。

## 正文

Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their `` traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior `` blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.