Artificial Analysis@ArtificialAnlys

2026-06-23 04:01·2天前

AI 摘要

Artificial Analysis发布AA-Briefcase智能体知识工作基准测试，评估模型在长期任务中的表现。任务成本差异超700倍，最高性能模型Claude Fable 5每任务超$20。成本-性能帕累托前沿上，除Anthropic两个最高分模型外，其余大部分由开放权重模型占据。关键性价比：GLM 5.2 (max)每任务$2.40，得分仅比Claude Opus 4.8低90 Elo，成本低65%；DeepSeek V4 Pro (max)每任务$0.08，得分比Gemini 3.5 Flash高约60 Elo，成本低98%以上。

Open weights models make up the majority of the cost-performance Pareto frontier on AA-Briefcase， our new agentic knowledge work benchmark

Last week we released AA-Briefcase， our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models， board presentations， and design mock-ups in the context of realistic multi week projects.

The cost to run a single AA-Briefcase task varies by over 700x in the initial set of models we tested. With the highest performing model， Claude Fable 5， costing over $20 per task， cost efficiency is a key element in model selection for knowledge work.

While the two highest performing models on the cost-performance Pareto frontier are proprietary models from @AnthropicAI， most of the remaining frontier is made up of open weights models.

Notable cost efficiency trade offs：

➤ At $2.40 per task， GLM 5.2 （max） from @Zai_org scores within 90 Elo points of Claude Opus 4.8 （max） while costing 65% less

➤ At $0.08 per task， DeepSeek V4 Pro （max） from @deepseek_ai scores ~60 Elo points above Gemini 3.5 Flash while costing over 98% less

智能体AnthropicDeepSeek推理

在 X 查看原推

Artificial Analysis@ArtificialAnlys · X