# AA-Briefcase 基准发布：评估模型长期知识工作智能体能力

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-06-19 07:01
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmqk4ni0x031bslhirxq6kc8r
- 原文链接：https://x.com/ArtificialAnlys/status/2067744637155226101

## AI 摘要

Artificial Analysis 推出新基准 AA-Briefcase，用于评估模型在长期知识工作项目中的智能体能力。基准包含 4 个私有场景（每项目需处理 25000+ Slack 消息、3500+ 邮件等碎片化上下文）及一个公开演示场景。评测结果：Claude Fable 5 以 Elo 1587 领先，其次为 Claude Opus 4.8（1356）、Opus 4.7 及智谱 GLM 5.2（max，1266）。成本方面，Claude Fable 5 平均每任务 $31，Opus 4.8 为 $10.40，GPT-5.5 (xhigh) 为 $3.68，GLM 5.2 (max) 为 $2.40，DeepSeek V4 Flash (max) 仅约 $0.04。所有模型中仅 3% 的任务满足全部标准，31/91 个任务无模型得分超 50%，显示真实世界复杂性仍是挑战。最佳性价比为开源权重模型 GLM-5.2 (max) 和 DeepSeek V4 Pro (max)。

## 正文

Announcing AA-Briefcase， the benchmark for the next era of agentic knowledge work

AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects， each with many linked tasks and thousands of input source files.

We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable， and it currently leads with an Elo score of 1587， followed by Claude Opus 4.8 （max， 1356）， Opus 4.7， and the recently-released GLM 5.2 （max， 1266） from @Zai_org.

Claude Fable 5 cost $31 on average to run each AA-Briefcase task， followed by Claude Opus 4.8 at $10.40， GPT-5.5 （xhigh） at $3.68 and GLM-5.2 （max） at $2.40.

AA-Briefcase comprises four private scenarios， each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure， submission， and grading （AA-Briefcase Lite）. This does not count toward official AA-Briefcase results， and is demonstrative only.

Key elements of AA-Briefcase：

➤ Realistic long-horizon projects： AA-Briefcase moves beyond single， disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week， draw on shared institutional context， and require deliverables such as financial models， board presentations， and design mock-ups

➤ Large volumes of fragmented context： AA-Briefcase requires models to reason across thousands of inputs， including company documents， meeting transcripts， large-scale data exports， 25，000+ Slack messages and 3，500+ emails. These sources are fragmented， messy， and often contain realistic contradiction， testing whether models can navigate the ambiguity of real-world knowledge work

➤ Composite rubric and pairwise grading： AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric， AA-Briefcase tests agentic capabilities more comprehensively， exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor

➤ Built by industry experts： AA-Briefcase scenarios mirror real-world knowledge work， with tasks developed over months by experts across data science， product management and corporate strategy from companies including Google， McKinsey & Company and BCG. Task challenges are drawn from professional experience， making AA-Briefcase more reflective of the ambiguity， messy context and competing priorities that define real-world knowledge work

Key results：

➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo： This is followed by Claude Opus 4.8 （1356） with the next-best non-Anthropic model， GLM-5.2 （max）， ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase

➤ Cost per task varies by ~800x across models tested： Claude Fable 5 leads the benchmark but costs more than $31 per task on average， compared to ~$0.04 for DeepSeek V4 Flash （max）. The strongest price/performance options are open weights models such as GLM-5.2 （max） and DeepSeek V4 Pro （max）， with GLM-5.2 （max） scoring only ~90 Elo below Claude Opus 4.8 （max） for less than 25% of the cost

➤ Real-world complexity remains difficult for models： The top performer， Claude Fable 5， satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks， no model scores above 50% on the rubric criteria

➤ Task difficulty scales with the number of required input files： For each rubric check， we identify the set of source files needed to pass. Across all models， pass rates fall as this file count increases， though top-tier models degrade less than weaker models

More details below in thread ⬇️
