GauntletBench：在陌生环境中重新评估AI智能体能力

2026-06-25 08:00·8天前

AI 摘要

GauntletBench是一个基于网络的基准测试，用于评估AI智能体在陌生场景中的泛化能力，聚焦时间感知、图形理解与3D推理三项未被充分探索的能力，覆盖视频编辑器、工作流构建器、3D建模器、飞行分析器和电路设计器五个专业应用，每项包含20个视觉密集型任务（共100个）。测试结果显示，最先进智能体的成功率仅19.1%，而人类非专家可达80%以上，凸显当前智能体与复杂现实场景之间的显著差距。

原文 · 未翻译

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.

HuggingFace Daily Papers（社区热门论文）

50导出 Markdown

GauntletBench：在陌生环境中重新评估AI智能体能力

2026-06-25 08:00·8天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译