斯坦福、MIT、英伟达、谷歌等顶级实验室联合提出新基准 AutoLab,包含 36 个任务。每个任务中,智能体从可工作的弱代码起步,需在固定时间内迭代优化。任务涵盖系统加速、谜题、模型开发和 CUDA 内核。17 个前沿模型测试结果显示,成功的关键不是初版方案有多好,而是能否持续测试、频繁实验并利用实证反馈。Claude Opus 4.6 领跑基准,靠的是坚持迭代而非初始判断力,而其他前沿模型要么提前放弃,要么思考过久导致超时。
Strong AI agents still struggle with long research work because they often fail to keep testing and improving.
New Stanford, MIT, NVIDIA, Google and other top labs paper shows shows that today's strongest research agents win less by brilliance than by refusing to stop testing.
The paper proposes AutoLab, a benchmark with 36 tasks where each agent starts from working but weak code and must make it better within a fixed time limit.
The tasks cover system speedups, puzzles, model development, and CUDA kernel work, so the test is not just about writing code once but about managing a long work session.
The authors tested 17 strong models and found that the best results did not mainly come from the first idea being good, but from the model staying active, testing often, and using feedback well.