Deedy@deedydas

2026-05-05 23:23·58天前

AI 摘要

SWE-Bench 的创建者刚刚发布了一个非常简单的新基准测试，所有 LLM 都得 0 分。 ProgramBench 提出的问题是：模型能否在没有互联网的情况下从零开始重建真实可执行程序（ffmpeg、SQLite、ripgrep）？我们在模型质量上还远未饱和。

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on.

ProgramBench asks： can models recreate real executable programs （ffmpeg， SQLite， ripgrep） from scratch with no internet？

We are far from saturated on model quality.

Deedy@deedydas · X

2026-05-05 23:23·58天前

AI 摘要

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on.

ProgramBench asks： can models recreate real executable programs （ffmpeg， SQLite， ripgrep） from scratch with no internet？

We are far from saturated on model quality.