AI 摘要
实验研究转化为产品需数月,但 AI 能力迭代极快,数月即可产生代差。新 IMO 题目测试中,所有模型表现均不及人类,Grok-4 即使采用 best-of-n 策略也表现糟糕。
It takes us a few months to turn the experimental research frontier into a product. But progress is so fast that a few months can mean a big difference in capabilities.
So, all the models underperform humans on the new International Mathematical Olympiad questions, and Grok-4 is especially bad on it, even with best-of-n selecti...