ITBench-AA:前沿大模型在首个智能体企业IT任务基准测试中得分均低于50%
阅读原文· huggingface.coIT运维这事儿,AI还是新手。ITBench-AA这份基准把Claude Opus 4.7逼到47%,开源模型GLM-5.1却用五分之一成本拿到40%,企业场景性价比可能不在闭源那边。
由Artificial Analysis和IBM推出的ITBench-AA SRE基准测试显示,所有前沿大模型得分均未超过50%。Claude Opus 4.7(自适应推理,最大努力)以47%领先,GPT-5.5(xhigh)和Qwen3.7 Max分别得46%和42%。该测试包含59个需要通过Shell命令调查Kubernetes事件快照并提交根因诊断的智能体任务。关键发现是模型推理轮次差异近3倍,但更长的轨迹并不转化为更高准确率,过度调查的模型会因提交误报而受罚。在成本方面,开源模型Gemma 4 31B(Reasoning)以每任务$0.14的成本获得37%得分,优于成本更高但得分更低的闭源模型。
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
Key findings:
- Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%.
- All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench.
- Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives.
- GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%.
ITBench-AA SRE overview:
- 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks
- Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident.
- Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions. Methodology details:
- Agentic harness: each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task.
- Models and agents submit a list of root-cause entities (Kubernetes Deployments, Services, Pods, etc.) they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM.
- Scoring uses average precision at full recall: if a model misses any of the ground-truth root causes, it scores 0.0 for that repeat. If it identifies all of them, it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes, i.e. true positives / (true positives + false positives). The headline score is the average across 59 tasks × 3 repeats.
- The harness (Stirrup) is held constant across all evaluated models, allowing an apples-to-apples comparison between models.
Highlights
- Tasks require agents to investigate Kubernetes incident snapshots through shell commands and submit a structured JSON diagnosis identifying the responsible root-cause entities. In one public SRE task, the agent sees user-facing failures in the frontend path. It uses shell commands to inspect the offline snapshot: reviewing alerts shows the incident window, then traces/logs narrow the failure to frontend traffic. Topology pins down the affected services, and Kubernetes manifests reveal a network policy blocking the frontend. The successful diagnosis identifies the responsible root-cause entity: otel-demo/NetworkPolicy/frontend-block-all-ports.
- More turns do not mean better answers. Models that submit additional contributing entities beyond the true root cause get penalized: identifying the correct root cause but adding upstream mechanisms (e.g., a chaos-mesh controller) or co-occurring symptoms counts as a false positive under recall-gated precision. This is why some models with long trajectories underperform terser ones: Gemini 3.1 Pro Preview averages 83 turns and scores 30%, while Gemma 4 31B (Reasoning) averages 58 turns and scores 37%.
- Open weights models sit on the cost frontier of ITBench-AA SRE. Gemma 4 31B (Reasoning) scores 37% at $0.14 per task, outperforming Gemini 3.1 Pro Preview ($2.23 per task, 30%) on both score and cost. GLM-5.1 (Reasoning) scores 40% at $1.23 per task, matching Gemini 3.5 Flash (high) ($1.70) on score at lower cost. Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads the leaderboard at 47% but is the most expensive at $5.38 per task.
ITBench-AA is built in partnership with @IBM based on their ITBench benchmark.
- For more information see: ITBench paper on arXiv: https://arxiv.org/abs/2502.05352
- GitHub: https://github.com/itbench-hub/ITBench
- ITBench-AA leaderboard: https://artificialanalysis.ai/evaluations/itbench-aa
- ITBench-AA HuggingFace repo: https://huggingface.co/datasets/ArtificialAnalysis/ITBench-AA/tree/main/sre
Datasets mentioned in this article 1
Community
Why Qwen 3.5 and not 3.6?
does this mean many are not ready for agentic ai?
· or to comment




