Artificial Analysis与IBM Research联合推出ITBench-AA,首个评估AI智能体在企业IT任务中表现的基准,首发任务为站点可靠性工程(SRE)。该基准包含59项Kubernetes事件响应任务,所有前沿模型得分均未超过50%。其中,Claude Opus 4.7以47%领先,GPT-5.5得46%,通义千问(Qwen3.7 Max)得42%。开源模型中,智谱GLM-5.1(推理)得分40%,与Gemini 3.5 Flash持平;深度求索(DeepSeek V4 Pro)得38%。分析还发现,模型推理轮次差异近3倍,但更长轮次并不保证更高准确率。
Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50%
ITBench-AA's SRE tasks benchmark model performance on Kubernetes incident response, where models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab, leveraging IBM's deep expertise in enterprise IT operations