Artificial Analysis@ArtificialAnlys

2026-05-11 23:49·52天前

AI 摘要

人工智能分析发布编码代理基准指数，评估不同模型与执行框架组合在三大编码基准中的表现。Opus 4.7在Cursor CLI中以61分领先，GPT-5.5与Opus 4.7在其它框架中得分60紧随其后。开源模型GLM-5.1在Claude Code中获得53分，表现竞争但仍显著落后顶尖闭源模型。经济性差异悬殊：每任务成本从Composer 2的0.07美元到GLM-5.1的2.26美元不等，后者因任务循环令牌使用高达480万；任务耗时差异超7倍，Opus 4.7仅需6分钟而Kimi K2.6需40分钟。缓存命中率普遍较高，影响实际运行成本。

Announcing the Artificial Analysis Coding Agent Index！ Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks， token usage， cost and more

When developers use AI to code they're choosing a model， but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance.

The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use： ➤ SWE-Bench-Pro-Hard-AA， 150 realistic coding tasks that frontier models struggle with， sampled from Scale AI's SWE-Bench Pro ➤ Terminal-Bench v2， 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA， 124 technical questions developed by Scale AI about how code behaves， root causes of issues， and more， requiring agents to explore codebases and give text answers

Analysis of results： ➤ Opus 4.7 and GPT-5.5 lead the Index： Opus 4.7 in Cursor CLI scores 61， followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58.

➤ Open weights models are competitive， but still trail the leaders： GLM-5.1 in Claude Code is the top open-weight result at 53， followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results， but still meaningfully behind the top proprietary models.

➤ Gemini 3.1 Pro in Gemini CLI underperforms： Gemini 3.1 Pro in Gemini CLI scores 43， well below where Gemini 3.1 Pro sits on our Intelligence Index， highlighting that Gemini's performance in Gemini CLI remains a relative weak spot for Google's offering.

➤ Cost per task （API token pricing） varies >30x： Composer 2 in Cursor CLI is cheapest at $0.07/task， followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end， GPT-5.5 in Codex costs $2.21/task， while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage， and in GPT-5.5's case by a relatively higher per token cost.

Artificial Analysis@ArtificialAnlys · X

65导出 Markdown

2026-05-11 23:49·52天前

在 X 看原推· x.com

AI 摘要

When developers use AI to code they're choosing a model， but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance.