Epoch AI@EpochAIResearch

2025-10-11 00:26·265天前

AI 摘要

在 FrontierMath Tier 4 极难数学基准测试中，GPT-5 Pro 以 13% 准确率创下新纪录，仅以一道题优势险胜 Gemini 2.5 Deep Think（统计差异不显著），Grok 4 Heavy 则明显落后。

We manually evaluated three compute-intensive model settings on our extremely hard math benchmark. FrontierMath Tier 4： Battle Royale！

GPT-5 Pro set a new record （13%）， edging out Gemini 2.5 Deep Think by a single problem （not statistically significant）. Grok 4 Heavy lags. 🧵

Epoch AI@EpochAIResearch · X

2025-10-11 00:26·265天前

AI 摘要

We manually evaluated three compute-intensive model settings on our extremely hard math benchmark. FrontierMath Tier 4： Battle Royale！

GPT-5 Pro set a new record （13%）， edging out Gemini 2.5 Deep Think by a single problem （not statistically significant）. Grok 4 Heavy lags. 🧵