Artificial Analysis@ArtificialAnlys

2026-05-28 02:08·36天前

AI 摘要

Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT任务中表现的基准，首发任务为站点可靠性工程（SRE）。该基准包含59项Kubernetes事件响应任务，所有前沿模型得分均未超过50%。其中，Claude Opus 4.7以47%领先，GPT-5.5得46%，通义千问（Qwen3.7 Max）得42%。开源模型中，智谱GLM-5.1（推理）得分40%，与Gemini 3.5 Flash持平；深度求索（DeepSeek V4 Pro）得38%。分析还发现，模型推理轮次差异近3倍，但更长轮次并不保证更高准确率。

Artificial Analysis and IBM Research are launching ITBench-AA， the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks， starting with Site Reliability Engineering tasks where frontier models score below 50%

ITBench-AA's SRE tasks benchmark model performance on Kubernetes incident response， where models must diagnose live systems by reading logs， tracing dependencies， and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab， leveraging IBM's deep expertise in enterprise IT operations

Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation， beginning with Site Reliability Engineering （SRE） and expanding to Financial Operations （FinOps） and Chief Information Security Officer （CISO） tasks over time

ITBench-AA SRE overview： ➤ 59 SRE tasks in total： 40 public tasks and 19 brand new， held-out tasks ➤ Each task provides a Kubernetes incident snapshot containing alerts， events， traces， metrics， logs， and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident ➤ Faults span typical SRE failure modes including infrastructure， service， application， and chaos-injected incidents， such as resource quota exhaustion， rollout failures， connection pool exhaustion， and network partitions

Methodology details： ➤ Agentic harness： each task is solved by the model running in our open-source Stirrup reference harness， with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task， 3 repeats per task ➤ Models submit a list of root-cause entities （Kubernetes Deployments， Services， Pods， etc.） they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research ➤ Scoring uses average precision at full recall： if a model misses any of the ground-truth root causes， it scores 0.0 for that repeat. If it identifies all of them， it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes， i.e. true positives / （true positives + false positives）. The headline score is the average across 59 tasks × 3 repeats. ➤ The harness （Stirrup） is held constant across all evaluated models， allowing an apples-to-apples comparison between models.

Artificial Analysis@ArtificialAnlys · X

71导出 Markdown

2026-05-28 02:08·36天前

在 X 看原推· x.com

AI 摘要