# Artificial Analysis与IBM联合推出首个AI智能体企业IT评测基准

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-05-28 02:08
- AIHOT 分数：71
- AIHOT 链接：https://aihot.virxact.com/items/cmpoeqjkx05idslv49g8l2ofq
- 原文链接：https://x.com/ArtificialAnlys/status/2059698327235805258

## AI 摘要

Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT任务中表现的基准，首发任务为站点可靠性工程（SRE）。该基准包含59项Kubernetes事件响应任务，所有前沿模型得分均未超过50%。其中，Claude Opus 4.7以47%领先，GPT-5.5得46%，通义千问（Qwen3.7 Max）得42%。开源模型中，智谱GLM-5.1（推理）得分40%，与Gemini 3.5 Flash持平；深度求索（DeepSeek V4 Pro）得38%。分析还发现，模型推理轮次差异近3倍，但更长轮次并不保证更高准确率。

## 正文

Artificial Analysis and IBM Research are launching ITBench-AA， the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks， starting with Site Reliability Engineering tasks where frontier models score below 50%

ITBench-AA's SRE tasks benchmark model performance on Kubernetes incident response， where models must diagnose live systems by reading logs， tracing dependencies， and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab， leveraging IBM's deep expertise in enterprise IT operations

Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation， beginning with Site Reliability Engineering （SRE） and expanding to Financial Operations （FinOps） and Chief Information Security Officer （CISO） tasks over time

ITBench-AA SRE overview：
➤ 59 SRE tasks in total： 40 public tasks and 19 brand new， held-out tasks
➤ Each task provides a Kubernetes incident snapshot containing alerts， events， traces， metrics， logs， and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident
➤ Faults span typical SRE failure modes including infrastructure， service， application， and chaos-injected incidents， such as resource quota exhaustion， rollout failures， connection pool exhaustion， and network partitions

Methodology details：
➤ Agentic harness： each task is solved by the model running in our open-source Stirrup reference harness， with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task， 3 repeats per task
➤ Models submit a list of root-cause entities （Kubernetes Deployments， Services， Pods， etc.） they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research
➤ Scoring uses average precision at full recall： if a model misses any of the ground-truth root causes， it scores 0.0 for that repeat. If it identifies all of them， it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes， i.e. true positives / （true positives + false positives）. The headline score is the average across 59 tasks × 3 repeats.
➤ The harness （Stirrup） is held constant across all evaluated models， allowing an apples-to-apples comparison between models.

Key findings：
➤ Claude Opus 4.7 （Adaptive Reasoning， Max Effort） leads at 47%， followed by GPT-5.5 （xhigh） at 46% and Qwen3.7 Max at 42%
➤ All frontier models score below 50%， making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context， frontier models score considerably higher on Terminal-Bench
➤ Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 （xhigh） averages 31 turns per task at 46%， while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives
➤ GLM-5.1 （Reasoning） leads open weights models at 40%， effectively tied with Gemini 3.5 Flash （high）. DeepSeek V4 Pro （Reasoning， Max Effort） follows at 38%， with Gemma 4 31B （Reasoning） at 37%， ahead of Gemini 3.1 Pro Preview at 30%
