AgentCL:面向语言智能体持续学习的严格评估框架
阅读原文· arxiv.orgAgentCL 是一个评估语言智能体持续学习的框架,核心是构造受控任务流和转移增益指标。受控流确保早期子解、证据或工作流可在后续任务中复用,而朴素流无法保证复用。框架还引入 MemProbe 探测方法,存储交互、洞察与技能,并在整合时过滤不可靠经验。在编码、深度研究和语言理解/推理任务上的实验表明,朴素流难以区分不同记忆设计,受控流能清晰区别其可塑性;朴素流与保留设置往往增益有限,甚至暴露记忆诱导的性能退化。研究揭示了平衡可塑性与稳定复用的更强记忆设计需求。
Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.