新版 GDPval-AA v2 成为 Intelligence Index v4.1 权重最高的评估,升级将 ELO 基线重置为人类 1000 分,引入前沿模型法官轮换面板,回合上限从 100 提升至 250。Claude Fable 5(有回退)以 1818 分领先,但当前不可用;Claude Opus 4.8 得 1638 分,GPT-5.5 (xhigh) 得 1531 分。Ethan Mollick 批评:AI 评估 AI 在取自另一闭卷基准的公开问题上表现意义有限,且人类 ELO 设定方式不透明,认为更新前后均非良好基准。
This was not a good benchmark before it was updated and it is not a good benchmark now. Having AIs evaluate the work of other AIs on publicly available questions from a different closed benchmark doesn't tell you very much.
And it is unclear how they establish the human ELO.