SWE-Bench 的创建者刚刚发布了一个非常简单的新基准测试,所有 LLM 都得 0 分。 ProgramBench 提出的问题是:模型能否在没有互联网的情况下从零开始重建真实可执行程序(ffmpeg、SQLite、ripgrep)? 我们在模型质量上还远未饱和。
The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on.
ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet?
We are far from saturated on model quality.