IntologyAI发布的NanoGPT-Bench评估显示,Codex、Claude Code和Autoresearch等编程代理在AI研发任务中,仅能恢复人类近9.3%的进展。这些代理的大部分算力消耗在超参数调优上,对核心的算法研究投入甚少。其中Claude Code和Autoresearch在推理中稍有涉及算法研究,但在实际代码实现层面依然不足。该评估基于NanoGPT Speedrun竞赛,采用标准化的五个月世界纪录窗口,完全自主端到端进行,以控制模型依赖和数据污染。结果表明,当前编程代理在自主执行真正AI研发的能力上仍有很大局限。
Very interesting results from this NanoGPT-Bench eval.
There is so much talk about self-improving agents.
But can coding agents do real AI R&D?
@IntologyAI reports that Codex, Claude Code, and Autoresearch recover only 9.3% of human progress.
Coding agents spend more of their compute on hyperparameter tuning.
In fact, coding agents rarely attempt algorithmic research at all.
Claude Code and Autoresearch both reason more about algorithmic research, but still dodge implementation.