伊利诺伊大学与清华大学等机构的研究发现,LLM智能体虽能从经验中学习,但其通过LLM将原始经历压缩成书面教训的记忆重写机制会损害记忆可靠性。在网页购物、模拟世界及ARC风格谜题等任务测试中,反复重写记忆会导致错误分组、规则过度泛化或过拟合,使智能体遗忘细节或混淆任务类型。例如,GPT-4在无记忆时可100%解决小型ARC-AGI问题集,而建立记忆并流式更新后,性能降至约54%。研究主张智能体记忆系统应重视原始经历作为关键证据,而非自动将所有经验重写为摘要,保留原始证据并选择性摘要效果更佳。
New Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories.
LLM agents can learn from experience, but their rewritten memories often become unreliable.
The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons.
That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory.
The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them.
The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions.
The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%.
The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples.
The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better.
The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away.
----
Paper Link - arxiv. org/abs/2605.12978
Paper Title: "Useful Memories Become Faulty When Continuously Updated by LLMs"