Anthropic 正式发布了 Claude Opus 4.8 模型。该模型在人工智能分析公司的 GDPval-AA 基准(专注于智能体的现实工作任务)上,以“max”努力设置获得了 1890 分。这一成绩比前代 Opus 4.7 高出 137 分,并以 121 分的优势领先于次优模型 GPT-5.5 xhigh。在直接对比中,这意味着 Opus 4.8 对 GPT-5.5 xhigh 拥有约 67% 的胜率。Anthropic 在模型公开发布前,为人工智能分析公司提供了早期访问权限以进行评测。
Anthropic just launched Claude Opus 4.8, and it is the new leader on our GDPval-AA benchmark for agentic real-world work tasks
Opus 4.8 scored 1890 on GDPval-AA at launch with its 'max' effort setting, +137 points from Opus 4.7 and +121 points ahead of the next-best model, GPT-5.5 xhigh.
Compared head-to-head on the GDPval task set, this implies a ~67% win rate against GPT-5.5 xhigh.
@AnthropicAI shared access with us ahead of the public release to benchmark this model and we're glad to see our benchmarks referenced in today's launch.