人工智能分析发布编码代理基准指数,评估不同模型与执行框架组合在三大编码基准中的表现。Opus 4.7在Cursor CLI中以61分领先,GPT-5.5与Opus 4.7在其它框架中得分60紧随其后。开源模型GLM-5.1在Claude Code中获得53分,表现竞争但仍显著落后顶尖闭源模型。经济性差异悬殊:每任务成本从Composer 2的0.07美元到GLM-5.1的2.26美元不等,后者因任务循环令牌使用高达480万;任务耗时差异超7倍,Opus 4.7仅需6分钟而Kimi K2.6需40分钟。缓存命中率普遍较高,影响实际运行成本。
Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more
When developers use AI to code they're choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance.
The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI's SWE-Bench Pro ➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers