# Artificial Analysis 发布 AA-Briefcase 智能体知识工作基准测试

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-06-25 06:43
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmqsnywwm04rhslfu3tkmnzba
- 原文链接：https://x.com/ArtificialAnlys/status/2069914443639635978

## AI 摘要

Artificial Analysis 发布 AA-Briefcase 基准测试，测试模型在多周项目语境下生成财务模型、董事会演示等交付物。关键结果：Claude Opus 4.8 平均每任务 23 分钟，得分最高但最慢；GPT-5.5 (xhigh) 仅 11 分钟，效率最高且 Elo 前五；GLM-5.2 得 1261 分耗时 16.3 分钟，为开源模型最佳；MiniMax-M3 得 1113 分。已下架的 Claude Fable 5 约需 28.5 分钟。工具调用仅占耗时 12%，其余由输出冗余、回合数和推理速度决定。

## 正文

Agentic knowledge work can take frontier models over 20 minutes per task， as measured in AA-Briefcase， our new benchmark

Last week we released AA-Briefcase， our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models， board presentations， and design mock-ups in the context of realistic multi week projects.

One of the key metrics we measure in AA-Briefcase is average time per task. This is calculated using evaluation token usage， representative model output speeds， and tool execution time recorded during evaluation.

Key time per task takeaways from AA-Briefcase：

➤ Claude Opus 4.8 is the highest-scoring available model， but it is also one of the slowest， taking ~23 minutes per task on average

➤ Several GPT-5.5 reasoning variants lie along the Pareto frontier of AA-Briefcase Elo vs. Time per Task， including medium， high， and xhigh. GPT-5.5 （xhigh） in particular stands out as one of the most efficient top-performing models， using around half the time per task of Opus 4.8 （11 minutes） while ranking top 5 on the overall AA-Briefcase Elo

➤ GLM-5.2 also sits on the Pareto frontier， scoring 1261， ahead of GPT-5.5 （xhigh， 1159） but also taking more time per task （16.3 minutes）. It is also the top-performing open weights model on AA-Briefcase， with MiniMax-M3 the next best at 1113

➤ If Claude Fable 5 were still available， it would likely take around 28.5 minutes per task： while it was live， we measured ~91 output tokens per second， ~3.1 minutes of tool execution time per task， and ~139，000 output tokens per task

➤ Time spent on tool calls and execution accounts for only ~12% of the total time， with the remaining amount explained by output verbosity， turn usage， and inference speed
