Artificial Analysis@ArtificialAnlys

2026-07-01 02:20·2天前

AI 摘要

GLM-5.2 在 Artificial Analysis Intelligence Index 中以 51 分成为开源权重智能最高的模型，但输出 token 达 1.41 亿（95% 推理），远超平均模型的 1.8 倍。相比之下，Claude Opus 4.8 输出 1.17 亿 token 得分 56，GPT-5.5 输出 7200 万 token 得分 55。近三分之二 token（8800 万）集中在 Humanity's Last Exam，是 GPT-5.5 的 3.2 倍，得分仅 40%（Opus 46%，GPT-5.5 44%）。AA-Omniscience 幻觉率评测中 GLM-5.2 仅得 4 分，远低于 Opus 4.8（27）、GPT-5.5（20）和 Gemini 3.5 Flash（23）。在 agentic 任务 GDPval-AA v2 上 GLM-5.2 为开源第一、整体第三，超过 GPT-5.5。其他开源模型如 DeepSeek V4 Pro 得分 44，落后 7 分。

GLM-5.2 is the most intelligent open weights model available， but also the most verbose among the leading models

GLM-5.2 （max） used ~141M output tokens （95% reasoning） to run the Artificial Analysis Intelligence Index （1.8x the average model）.

Key takeaways：

➤ GLM-5.2 generates more tokens （141M） to run the Artificial Analysis Intelligence Index than Claude Opus 4.8 （117M） and nearly double GPT-5.5 （72M）， while scoring below both （51 vs 56 and 55）

➤ Almost two-thirds of that goes to a single benchmark， Humanity's Last Exam： ~88M tokens， 3.2x GPT-5.5's， and it still scores lowest of the three （40% vs Opus 46% and GPT-5.5 44%）

➤ The verbosity is not focused on recalling facts. On AA-Omniscience， which measures hallucination rates， GLM-5.2 thinks less than GPT-5.5 yet scores just 4， far below Opus 4.8 （27）， GPT-5.5 （20）， and Gemini 3.5 Flash （23）

➤ Additional thinking pays off most on agentic real-world work： on GDPval-AA v2 GLM-5.2 is the top open weights model and #3 overall， beating GPT-5.5

➤ Several open models generate even more output， but all score lower on intelligence； the strongest of them， DeepSeek V4 Pro， trails GLM-5.2 by 7 points （44 vs 51）

开源生态推理评测/基准

Artificial Analysis@ArtificialAnlys · X

53导出 Markdown

2026-07-01 02:20·2天前

在 X 看原推· x.com

AI 摘要

GLM-5.2 is the most intelligent open weights model available， but also the most verbose among the leading models

GLM-5.2 （max） used ~141M output tokens （95% reasoning） to run the Artificial Analysis Intelligence Index （1.8x the average model）.