Artificial Analysis发布AA-Briefcase智能体知识工作基准测试,评估模型在长期任务中的表现。任务成本差异超700倍,最高性能模型Claude Fable 5每任务超$20。成本-性能帕累托前沿上,除Anthropic两个最高分模型外,其余大部分由开放权重模型占据。关键性价比:GLM 5.2 (max)每任务$2.40,得分仅比Claude Opus 4.8低90 Elo,成本低65%;DeepSeek V4 Pro (max)每任务$0.08,得分比Gemini 3.5 Flash高约60 Elo,成本低98%以上。
Open weights models make up the majority of the cost-performance Pareto frontier on AA-Briefcase, our new agentic knowledge work benchmark
Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects.
The cost to run a single AA-Briefcase task varies by over 700x in the initial set of models we tested. With the highest performing model, Claude Fable 5, costing over $20 per task, cost efficiency is a key element in model selection for knowledge work.
While the two highest performing models on the cost-performance Pareto frontier are proprietary models from @AnthropicAI, most of the remaining frontier is made up of open weights models.
Notable cost efficiency trade offs:
➤ At $2.40 per task, GLM 5.2 (max) from @Zai_org scores within 90 Elo points of Claude Opus 4.8 (max) while costing 65% less
➤ At $0.08 per task, DeepSeek V4 Pro (max) from @deepseek_ai scores ~60 Elo points above Gemini 3.5 Flash while costing over 98% less