Artificial Analysis@ArtificialAnlys

2026-04-08 23:09·85天前

AI 摘要

Artificial Analysis 发布 APEX-Agents-AA 排行榜，基于 Mercor 的 APEX-Agents 基准评估 AI 代理在长周期专业任务（投资银行、管理咨询、公司法）的表现。测试通过 Stirrup 框架和 MCP 工具执行 452 个任务，涵盖消息回复、文档处理等。结果显示 GPT-5.4 以 33.3% 领先，Claude Opus 4.6 (33.0%) 和 Gemini 3.1 Pro Preview (32%) 紧随其后，三强竞争激烈。评分采用 LLM 评判和 pass@1 标准。

Announcing APEX-Agents-AA， our latest leaderboard on Artificial Analysis， evaluating AI agents on long-horizon professional services tasks with realistic application dependencies

This is our implementation of the APEX-Agents benchmark - an agentic work task evaluation open-sourced by @mercor_ai. It tests AI agent ability to execute realistic tasks created by investment banking analysts， management consultants， and corporate lawyers. Mercor released extensive data to enable model evaluation and training across the community， comprising 480 tasks including tool implementations， rubrics， and grading workflows.

We exclude tasks with external service dependencies and run the remaining 452 tasks for APEX-Agents-AA. Models complete tasks using Stirrup， our open-source agent harness as used in GDPval-AA， and a customized tool set based on the original benchmark implementation

Results overview： 🏅 OpenAI， Anthropic and Google are in close competition at the top of the leaderboard， with 33.3% for GPT-5.4， 33.0% for Claude Opus 4.6， and 32% for Gemini 3.1 Pro Preview

📈 The overall scores on Artificial Analysis today are similar to Mercor's testing， but some models such as GPT-5.4 nano show improvements in score using our Stirrup test harness

↻ We'll be updating this leaderboard with key releases for agentic work use as a metric for agent capability on well-defined， long horizon work tasks

APEX-Agents overview： ➤ Tasks span 3 professional domains： investment banking， management consulting， and corporate law

➤ The tasks are designed to require long-horizon work with a large number of tools， which are provided through MCP servers as would be used in many real-world deployments （including calendar， chat， spreadsheet and presentation operations， etc.）

➤ Required outputs include direct message responses （87%） and creating or modifying spreadsheets （6.6%）， documents （4.8%）， and presentations （1.3%）

Artificial Analysis@ArtificialAnlys · X

导出 Markdown

2026-04-08 23:09·85天前

在 X 看原推· x.com

AI 摘要

Announcing APEX-Agents-AA， our latest leaderboard on Artificial Analysis， evaluating AI agents on long-horizon professional services tasks with realistic application dependencies