SkillFlow：面向自主智能体的终身技能发现与演化基准测试

2026-04-19 08:00·75天前

AI 摘要

SkillFlow 基准测试包含166个跨20个任务家族的任务，通过智能体终身学习协议评估自主智能体从零发现、修补和维持技能库的能力。实验显示，Claude Opus 4.6 通过终身技能演化将任务成功率从62.65%提升至71.08%，而 Kimi K2.5 尽管技能使用率高达66.87%却只提升0.60个百分点，Qwen-Coder-Next 完成率仅44.58%且相对基线出现退化，揭示技能使用率与实用性之间存在显著落差。

原文 · 未翻译

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

SkillFlow：面向自主智能体的终身技能发现与演化基准测试

2026-04-19 08:00·75天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译