德克萨斯大学论文指出,AI 智能体在部署后即使模型不变,也会因长期记忆的摘要压缩、相似记忆混淆、事实更新失效及维护操作而可靠性下降。例如药物剂量可能变成“每日用药”,相似客户记录混淆,已取消订阅仍保留,日程可能因维护消失。论文提出 AgingBench 基准测试,评估智能体在多次会话中的可靠性。研究强调“增加更多记忆”往往是错误修复——问题可能在于从未写入、写入后被挤掉、或写入后未被信任使用。论文将部署智能体重新定义为类似老化基础设施的系统。
Univ of Texas paper shows AI agents can slowly become less reliable after deployment, even when the model itself does not change.
The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance.
An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house.
Every one of those steps can quietly rot.
A medication dose can become "a daily medication," two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass.