Artificial Analysis@ArtificialAnlys

2026-07-02 05:09·1天前

AI 摘要

Anthropic发布Claude Sonnet 5。在AA-Briefcase（智能体知识工作基准，测试模型处理数千文件并产出表格、演示和UI原型）上，Sonnet 5 (max)得1391 Elo，较Sonnet 4.6 (max)提升312分，排第二，仅次于Fable 5。提升来自rubric评分与分析质量，呈现仍落后Opus 4.8。max设置得分最高，但较低设置不处成本-性能帕累托前沿；Opus 4.8 (max)、GLM-5.2 (max)和MiniMax-M3在低努力下性价比更优。Sonnet 5成本较高，因turn数大增：max平均每任务183 turns（Sonnet 4.6 max的4倍多），medium平均55 turns，各设置成本跨度约17倍。

Claude Sonnet 5 ranks second only to Fable 5 on AA-Briefcase， our new agentic knowledge work benchmark， with a ~17x cost per task range across its five effort settings

@AnthropicAI has released Claude Sonnet 5， the latest addition to the Claude Sonnet family. On AA-Briefcase， Claude Sonnet 5 （max） scores 1391 Elo， a +312 point improvement over Claude Sonnet 4.6 （max）， making it the second highest scoring model behind Claude Fable 5. This gain is driven primarily by improvements in rubric scoring and analytical quality， with Sonnet 5 trailing Claude Opus 4.8 on Presentation Elo.

We benchmarked all 5 available effort settings for Claude Sonnet 5：

➤ Max effort achieves the second highest AA-Briefcase Elo， but lower efforts are not Pareto efficient： Claude Sonnet 5 （max） achieves the highest AA-Briefcase score among Sonnet 5 effort settings， but lower effort settings do not reach the cost-performance Pareto frontier. Models such as Claude Opus 4.8 （max）， GLM-5.2 （max）， and MiniMax-M3 offer stronger cost-performance trade-offs than Claude Sonnet 5 at lower effort settings

➤ Substantially higher turn use across effort levels： Claude Sonnet 5's higher cost is driven by an increased number of turns， with Sonnet 5 （max） averaging 183 turns per AA-Briefcase task， more than 4x that of Claude Sonnet 4.6 （max）. This increase is consistent across effort levels， with Claude Sonnet 5 （medium） averaging 55 turns per task， in line with Claude Opus 4.8 with max effort

AA-Briefcase is our new proprietary benchmark for agentic knowledge work. It tests models on realistic tasks across thousands of input files， requiring deliverables such as spreadsheets， presentations， and UI mock-ups. Model performance is measured across three dimensions： binary rubric checks for ground-truth correctness， pairwise grading on analytical quality， and pairwise grading on presentation quality. The AA-Briefcase Elo is a single metric that combines results across all three dimensions

Artificial Analysis@ArtificialAnlys · X

55导出 Markdown

2026-07-02 05:09·1天前

在 X 看原推· x.com

AI 摘要

Claude Sonnet 5 ranks second only to Fable 5 on AA-Briefcase， our new agentic knowledge work benchmark， with a ~17x cost per task range across its five effort settings