# OPD-Evolver：通过在线策略自蒸馏培养全能智能体进化器

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：43
- AIHOT 链接：https://aihot.virxact.com/items/cmqhgiko20471sle1uw063fhv
- 原文链接：https://arxiv.org/abs/2606.17628

## AI 摘要

OPD-Evolver是一个慢-快协同进化框架，基于在线策略自蒸馏培养智能体进化器。快速循环中，智能体与四级记忆层次交互，实现读取、使用、编写和维护经验的快速测试时进化；慢速循环通过结果校准的记忆归因和特权后见，将这四种能力蒸馏至可部署策略。在多领域基准测试中，OPD-Evolver性能超越ReasoningBank达11.5%，超越Skill0约5.8%。分析表明，其内化了高价值经验与记忆管理，使得9B参数版本能够挑战Qwen3.5-397B-A17B和Step-3.5-Flash等千亿级模型。

## 正文

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.