# SkillVerse多模态技能范式与VisSkillBot：AI智能体的技能应超越纯文本

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-31 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmpwoqq8804t6slsnomxqntta
- 原文链接：https://arxiv.org/abs/2606.01414

## AI 摘要

现有AI智能体的可复用技能多以纯文本形式存储，这在视觉中心任务中构成了瓶颈。研究提出了SkillVerse多模态技能范式，将声明式文本逻辑与显式视觉支持相结合，包含静态先验、动态先验和交错视觉技能三种可复用形式。配套系统VisSkillBot能自动将智能体经验转化为可复用的多模态技能。实验表明，视觉技能在需要空间对应、视觉证据和状态感知交互的GUI等任务中，持续优于纯文本技能。

## 正文

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \NAME, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \SYSTEM, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.