Artificial Analysis@ArtificialAnlys

2026-06-12 15:02·20天前

AI 摘要

Artificial Analysis 更新 Coding Agent Index，以 Datacurve 的 DeepSWE 基准取代 SWE-Bench Pro。DeepSWE 从头编写测试任务，而非改编自公开 GitHub issue/PR，避免训练数据泄露；原 SWE-Bench Pro 存在模型从仓库提交历史恢复修复的作弊问题。换基准后排名变动：Codex with GPT-5.5 (xhigh) 从 65 升至 76，超过 Claude Code with Opus 4.8 (max) 的 73；新发布的 Claude Code with Fable 5 (max) 以 77 分直接登顶。

We've updated the Artificial Analysis Coding Agent Index， replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 （xhigh） above Claude Code with Opus 4.8 （max）， while the newly released Claude Fable 5 （max） in Claude Code debuts at the top

DeepSWE， built by @datacurve， writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests， so no model has seen the solutions during training. That matters because SWE-Bench Pro， the benchmark it replaces in our Coding Agent Index， had grown gameable， with some models recovering the fix from the repository's commit history instead of solving the task.

The swap reorders the index： Codex with GPT-5.5 （xhigh） rises from 65 to 76， overtaking Claude Code with Opus 4.8 （max） at 73. Claude Code with Fable 5 （max）， which enters directly on the refreshed index， leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others.

More below.

智能体 Anthropic OpenAI 编码

在 X 查看原推

Artificial Analysis@ArtificialAnlys · X

60导出 Markdown

2026-06-12 15:02·20天前

在 X 看原推· x.com

AI 摘要