# SkillFlow：面向自主智能体的终身技能发现与演化基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-19 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo86vj4q044zslmlzz9xnnyv
- 原文链接：https://arxiv.org/abs/2604.17308

## AI 摘要

SkillFlow 基准测试包含166个跨20个任务家族的任务，通过智能体终身学习协议评估自主智能体从零发现、修补和维持技能库的能力。实验显示，Claude Opus 4.6 通过终身技能演化将任务成功率从62.65%提升至71.08%，而 Kimi K2.5 尽管技能使用率高达66.87%却只提升0.60个百分点，Qwen-Coder-Next 完成率仅44.58%且相对基线出现退化，揭示技能使用率与实用性之间存在显著落差。

## 正文

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.
