在 FrontierMath Tier 4 极难数学基准测试中,GPT-5 Pro 以 13% 准确率创下新纪录,仅以一道题优势险胜 Gemini 2.5 Deep Think(统计差异不显著),Grok 4 Heavy 则明显落后。
We manually evaluated three compute-intensive model settings on our extremely hard math benchmark. FrontierMath Tier 4: Battle Royale!
GPT-5 Pro set a new record (13%), edging out Gemini 2.5 Deep Think by a single problem (not statistically significant). Grok 4 Heavy lags. 🧵