Alibaba Cloud@alibaba_cloud

2026-05-28 15:11·35天前

AI 摘要

由 Artificial Analysis 和 IBM Research 合作推出的首个评估模型处理真实企业IT任务能力的基准测试 ITBench-AA，聚焦于站点可靠性工程（SRE）任务。测试结果显示，通义千问（Qwen3.7-Max）以 42% 的分数排名第三。该测试中，所有前沿模型得分均低于 50%，其中 Claude Opus 4.7 以 47% 领先，GPT-5.5（xhigh）以 46% 紧随其后。在开源模型中，GLM-5.1（Reasoning）以 40% 领衔。该基准未来将扩展到财务运营（FinOps）等任务。

📢Qwen3.7-Max just hit #3 on ITbench-AA - a fresh benchmark testing how well models handle real-world enterprise IT tasks， agentic-style.

🔧Agentic era， go with Qwen.🏃🏃

Artificial AnalysisArtificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, startin...

智能体评测/基准部署/工程

在 X 查看原推导出 Markdown

Alibaba Cloud@alibaba_cloud · X

59导出 Markdown

2026-05-28 15:11·35天前

在 X 看原推· x.com

AI 摘要

📢Qwen3.7-Max just hit #3 on ITbench-AA - a fresh benchmark testing how well models handle real-world enterprise IT tasks， agentic-style.

🔧Agentic era， go with Qwen.🏃🏃

Artificial AnalysisArtificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, startin...

智能体评测/基准