所有基准测试都有缺陷,但GPQA一直相当稳定且与其他测量基准高度相关。我认为这是一个很好的方式来看我们已经走了多远,OpenAI的免费模型GPT 5.5 Instant已经达到了甚至付费模型直到2025年底才达到的水平
All benchmarks are flawed, but GPQA has been fairly consistent &; highly correlated with other measured benchmars. I think it's a good way to see how far we've come that the free model from OpenAI, GPT 5.5 Instant, is at a level that even paid models did not reach until late 2025