Ethan Mollick@emollick

2026-06-17 06:21·16天前

AI 摘要

新版 GDPval-AA v2 成为 Intelligence Index v4.1 权重最高的评估，升级将 ELO 基线重置为人类 1000 分，引入前沿模型法官轮换面板，回合上限从 100 提升至 250。Claude Fable 5（有回退）以 1818 分领先，但当前不可用；Claude Opus 4.8 得 1638 分，GPT-5.5 (xhigh) 得 1531 分。Ethan Mollick 批评：AI 评估 AI 在取自另一闭卷基准的公开问题上表现意义有限，且人类 ELO 设定方式不透明，认为更新前后均非良好基准。

This was not a good benchmark before it was updated and it is not a good benchmark now. Having AIs evaluate the work of other AIs on publicly available questions from a different closed benchmark doesn't tell you very much.

And it is unclear how they establish the human ELO.

Artificial AnalysisGDPval-AA v2 is the highest weighted evaluation in the Intelligence Index v4.1. The upgrade re-baselines ELO to human performance at 1000, introduces a rotating...

大佬观点评测/基准

在 X 查看原推导出 Markdown

Ethan Mollick@emollick · X

29导出 Markdown

2026-06-17 06:21·16天前

在 X 看原推· x.com

AI 摘要

And it is unclear how they establish the human ELO.

Artificial Analysis