Meta、CMU等机构在论文中提出了Self-play SWE-RL方法。该方法让编程智能体通过“自我博弈”生成训练数据,而非仅依赖人工标注的问题。具体而言,一个模型探索代码库、注入bug并留下测试用例来描述问题;另一个模型则学习根据测试修复系统。其中,测试成为了描述问题的核心语言。该方法在SWE-bench Verified上提升了+10.4分,在SWE-Bench Pro上提升了+7.8分。值得注意的是,评估使用了该系统未训练过的自然语言问题,表明其可能学到了更深层的软件理解能力。
Brilliant new paper from Meta, CMU and other labs.
Shows that coding agents improve faster by manufacturing their own software experience.
Coding agents can train themselves by making and fixing bugs inside real projects.
Most coding agents still learn from human leftovers: issues, pull requests, tests, comments, and benchmarks that describe what went wrong.
That is useful, but it makes the agent dependent on the rate at which humans produce clean, verifiable lessons.
Self-play SWE-RL changes the unit of learning from a labeled task to an executable situation.