Alibaba Cloud@alibaba_cloud

2026-05-28 16:08·35天前

AI 摘要

通义千问（Qwen）团队宣布，其Qwen3.7-Max模型在新兴的ITBench-AA基准测试中位列第三。该测试由Artificial Analysis与IBM Research合作推出，旨在评估模型解决真实企业IT任务的能力，当前聚焦于站点可靠性工程（SRE）领域。测试包含59个Kubernetes故障诊断任务。结果显示，Claude Opus 4.7以47%的得分排名第一，GPT-5.5（xhigh）以46%紧随其后，Qwen3.7-Max以42%排名第三。所有前沿模型得分均低于50%，表明该测试具有较高挑战性。

📢Qwen3.7-Max just hit #3 on ITbench-AA - a fresh benchmark testing how well models handle real-world enterprise IT tasks， agentic-style.

🔧Agentic era， go with Qwen.🏃🏃

API： https://int.alibabacloud.com/m/1000413314/

Artificial AnalysisArtificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, startin...

智能体推理评测/基准

在 X 查看原推导出 Markdown

Alibaba Cloud@alibaba_cloud · X

62导出 Markdown