SciAgentArena:评估AI智能体应对跨尺度科学挑战的基准测试
阅读原文· arxiv.org为填补现有AI智能体在真实科研场景中评估的空白,SciAgentArena被提出,包含约200个跨领域科学任务,支持逐步验证与交互式评估。测试发现,当前AI智能体在任务结构与评价标准明确的特定数据分析流程中能有效发挥作用,但在生成新颖见解、持续自主探索以及为开放式科研问题构建稳健方案方面仍表现不均。该基准为衡量科学领域AI智能体的进展提供了实用框架,相关代码、任务与数据集已开源。
AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.