SkillEvolBench:评估从情景经验到程序性技能的演进
阅读原文· arxiv.org该研究提出了SkillEvolBench,一个用于评估大语言模型智能体能否将情景经验提炼为可复用程序性技能的诊断基准。基准包含180个任务,分布在六个真实智能体环境中。测试发现,当前智能体通常只能局部适应,很少能形成稳健的可复用技能。基于技能的条件有时能改善获取或重放,但在冻结部署任务下表现不稳定。原始轨迹重用经常优于蒸馏的技能,表明当前的抽象过程丢弃了对未来任务仍有用的上下文和程序性线索。研究基于十个模型配置和三个智能体工具包,指出仅写入更多技能或更大的资源库并不足够。
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.