Rohan Paul@rohanpaul_ai

2026-05-26 22:40·37天前

AI 摘要

Meta、CMU等机构在论文中提出了Self-play SWE-RL方法。该方法让编程智能体通过“自我博弈”生成训练数据，而非仅依赖人工标注的问题。具体而言，一个模型探索代码库、注入bug并留下测试用例来描述问题；另一个模型则学习根据测试修复系统。其中，测试成为了描述问题的核心语言。该方法在SWE-bench Verified上提升了+10.4分，在SWE-Bench Pro上提升了+7.8分。值得注意的是，评估使用了该系统未训练过的自然语言问题，表明其可能学到了更深层的软件理解能力。

Brilliant new paper from Meta， CMU and other labs.

Shows that coding agents improve faster by manufacturing their own software experience.

Coding agents can train themselves by making and fixing bugs inside real projects.

Most coding agents still learn from human leftovers： issues， pull requests， tests， comments， and benchmarks that describe what went wrong.

That is useful， but it makes the agent dependent on the rate at which humans produce clean， verifiable lessons.

Self-play SWE-RL changes the unit of learning from a labeled task to an executable situation.

One version of the model explores a real codebase， weakens tests， injects a meaningful bug， and leaves behind test artifacts that define the failure without needing an English issue description.

Another version of the same model has to repair the system， not by matching words to patches， but by restoring behavior under tests.

Here's the key point： the test is not just a grader here， it is the language of the problem.

That matters because software understanding lives in constraints， dependencies， edge cases， and invariants that prose often compresses or misses.

The reported gains， +10.4 points on SWE-bench Verified and +7.8 on SWE-Bench Pro， are early but hard to ignore because evaluation still used natural-language issues the self-play system did not train on.

That suggests SSR （Self-play SWE-RL） is learning something deeper than issue phrasing， though not yet anything like open-ended mastery.

The restraint matters： generated bugs can be artificial， rewards can be noisy， and sandboxed repositories are still a narrow slice of software reality.

Still， the direction is sharp.

The next bottleneck for coding agents may not be more human-written tasks， but more ways for agents to encounter， create， survive， and learn from failure.

Rohan Paul@rohanpaul_ai · X

61导出 Markdown