# ITBench-AA基准发布：评估AI智能体在企业IT运维任务上的表现

- 来源：Qwen (@Alibaba_Qwen)
- 发布时间：2026-05-28 14:55
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpp5fvtl0c15slv4zyftvpsj
- 原文链接：https://x.com/Alibaba_Qwen/status/2059891171405787169

## AI 摘要

Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT运维任务上表现的基准。首批测试聚焦站点可靠性工程（SRE），包含59项Kubernetes事件响应任务。模型需在限定轮次内，通过分析日志、追踪依赖等方式，诊断出导致事件的根本原因实体。该基准采用Stirrup框架，以“全召回下的平均精度”作为评分标准。关键发现显示，Claude Opus 4.7以47%的得分领先，GPT-5.5得46%，通义千问Qwen3.7 Max以42%位列第三。所有前沿模型得分均低于50%，表明该基准极具挑战性。开源模型中，GLM-5.1（推理）以40%领先。

## 正文

📢Qwen3.7-Max just hit #3 on ITbench-AA - a fresh benchmark testing how well models handle real-world enterprise IT tasks， agentic-style.

🔧Agentic era， go with Qwen.🏃🏃

### 引用推文

> Artificial Analysis：Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, startin...