# Workflow-GYM：面向真实世界专业领域长周期GUI智能体任务的基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-09 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmq7h8hbk033ssl5wb8lnkj9c
- 原文链接：https://arxiv.org/abs/2606.11042

## AI 摘要

Workflow-GYM是专门评估AI智能体在专业领域和专用软件环境下执行长周期GUI任务的基准。实验表明，即使是最强模型，成功率也仅略高于30%，凸显出专业长周期GUI工作流对当前智能体的巨大挑战。进一步分析发现，智能体难以维持工作流一致性，频繁出现阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。这些发现揭示了当前智能体的局限性，并为下一代GUI智能体研究指明了关键方向。

## 正文

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.
