# GTA-2：从原子级工具使用到开放式工作流的通用工具智能体基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-17 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo6ofjdx052tsl4rk1z1ifcg
- 原文链接：https://arxiv.org/abs/2604.15715

## AI 摘要

研究团队发布GTA-2基准测试，用于评估通用工具智能体从原子级操作到开放式工作流的综合能力。该基准包含GTA-Atomic（短期封闭任务）和GTA-Workflow（长期开放任务），采用递归检查点机制分解目标并评估端到端完成度。实验显示，前沿模型在原子任务上成功率不足50%，在工作流任务中仅达14.39%。分析表明，检查点反馈及Manus、OpenClaw等执行框架可显著提升性能，凸显执行架构设计比底层模型能力更为关键。

## 正文

The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.
