Qwen@Alibaba_Qwen

2026-05-28 14:55·35天前

AI 摘要

Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT运维任务上表现的基准。首批测试聚焦站点可靠性工程（SRE），包含59项Kubernetes事件响应任务。模型需在限定轮次内，通过分析日志、追踪依赖等方式，诊断出导致事件的根本原因实体。该基准采用Stirrup框架，以“全召回下的平均精度”作为评分标准。关键发现显示，Claude Opus 4.7以47%的得分领先，GPT-5.5得46%，通义千问Qwen3.7 Max以42%位列第三。所有前沿模型得分均低于50%，表明该基准极具挑战性。开源模型中，GLM-5.1（推理）以40%领先。

📢Qwen3.7-Max just hit #3 on ITbench-AA - a fresh benchmark testing how well models handle real-world enterprise IT tasks， agentic-style.

🔧Agentic era， go with Qwen.🏃🏃

Artificial AnalysisArtificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, startin...

智能体评测/基准

在 X 查看原推导出 Markdown

Qwen@Alibaba_Qwen · X

60导出 Markdown

2026-05-28 14:55·35天前

在 X 看原推· x.com

AI 摘要

📢Qwen3.7-Max just hit #3 on ITbench-AA - a fresh benchmark testing how well models handle real-world enterprise IT tasks， agentic-style.

🔧Agentic era， go with Qwen.🏃🏃

Artificial AnalysisArtificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, startin...

智能体