Artificial Analysis@ArtificialAnlys

2026-07-01 04:59·2天前

AI 摘要

Claude Sonnet 5 以 max effort 在 Artificial Analysis Intelligence Index 上得分 53（第 5 名），比 Sonnet 4.6 提高 6 分，与 GPT-5.5 (xhigh) 持平，落后 Opus 4.7/4.8 约 2-3 分。标准定价下每任务成本 $2.29，比 Sonnet 4.6 贵约 2 倍、比 Opus 4.8 贵 15%，主要因输出 token 增加 40%、agentic 任务调用次数增加约 3 倍。定价 $3/$15 每百万 token（促销至 9 月 1 日降至 $2/$10），上下文窗口 1M tokens，新增 xhigh 力度设置。在 agentic 知识工作基准 AA-Briefcase 和 GDPval-AA 上匹配或超越 Opus 4.8，推理基准仍落后。Terminal-Bench v2.1（+9）、HLE（+10）、SciCode（+7）显著提升。

Claude Sonnet 5 achieves 53 on the Artificial Analysis Intelligence Index， but without promotional pricing will cost more per task than Opus 4.8

We supported @AnthropicAI to evaluate Claude Sonnet 5 ahead of release： with max effort it improves 6 points over Sonnet 4.6 to achieve the same Intelligence Index as GPT-5.5 with high reasoning， but remains behind Opus 4.7 and 4.8

Key takeaways：

➤ Claude Sonnet 5 is the #5 model on the Artificial Analysis Intelligence Index， only 2-3 points behind GPT-5.5 （xhigh） and Opus 4.8 （max）

➤ With max effort， Sonnet 5 works harder than previous Anthropic models： it used ~40% more output tokens per Intelligence Index task than Sonnet 4.6， and ~3x the agentic turns for our knowledge work evaluations AA-Briefcase and GDPval-AA. This behavior scales well with the 'effort' setting， with the max effort using around 6x more turns than low effort on GDPval-AA

➤ Claude Sonnet 5 costs more per task than Opus 4.8 before accounting for promotional pricing： Claude Sonnet 5 costs $2.29 per task on the Intelligence Index， a ~2x increase compared to Sonnet 4.6 and ~15% more than Claude Opus 4.8. This is driven entirely by increased token usage. Sonnet 5 retains the same $3/$15 per 1M input/output token pricing as Sonnet 4.6 （compared to $5/$25 for Opus 4.8）， however Anthropic is offering a one-third reduction to $2/$10 until September 1. Our results use standard $3/$15 pricing

➤ Sonnet 5 matches or outperforms Opus 4.8 on agentic knowledge work tasks： on both AA-Briefcase and GDPval-AA， Claude Sonnet 5 sits just ahead of Opus 4.8， trailing only Claude Fable 5 （which is not currently generally available）. These benchmarks test the ability of models to produce accurate and well-presented professional outputs using our open source reference agent harness， Stirrup

Artificial Analysis@ArtificialAnlys · X

60导出 Markdown

2026-07-01 04:59·2天前

在 X 看原推· x.com

AI 摘要

Claude Sonnet 5 achieves 53 on the Artificial Analysis Intelligence Index， but without promotional pricing will cost more per task than Opus 4.8