SkillFlow:面向自主智能体的终身技能发现与演化基准测试
阅读原文· arxiv.orgSkillFlow 基准测试包含166个跨20个任务家族的任务,通过智能体终身学习协议评估自主智能体从零发现、修补和维持技能库的能力。实验显示,Claude Opus 4.6 通过终身技能演化将任务成功率从62.65%提升至71.08%,而 Kimi K2.5 尽管技能使用率高达66.87%却只提升0.60个百分点,Qwen-Coder-Next 完成率仅44.58%且相对基线出现退化,揭示技能使用率与实用性之间存在显著落差。
As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.