SEAL:智能体与学习环境的协同进化
阅读原文· arxiv.org针对大语言模型智能体自我进化中,策略与环境被单独优化导致的错位问题,本文提出SEAL框架。它构建了一个闭环协同进化系统:通过收集策略轨迹并诊断失败,将失败诊断作为共享信号,同时优化智能体的模型策略和训练环境。环境侧进化其学习接口,提供更明确的工具可用性提示;策略侧则利用诊断信息更新模型。实验表明,仅使用400个训练样本,SEAL在三种骨干网络上平均提升了8.25至26.25个点,并展现出跨领域迁移能力。
Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as Agent-Environment Misalignment: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.