学习构建环境:通过可验证环境合成实现自我演进的推理强化学习
阅读原文· arxiv.org研究提出一种语言模型自我改进的新范式,使模型从被动生成数据转为主动构建训练自身的可执行环境。该方法的核心在于环境需具备稳定的“求解-验证不对称性”:模型能编写验证器代码,却无法可靠地用自然语言解决新实例。这种不对称性确保了奖励信号的有效性。研究者实例化为EvoEnv方法,通过合成Python环境并经过多阶段严格验证后才用于训练。在较强的Qwen3-4B-Thinking模型上测试表明,传统方法性能下降,而EvoEnv将其平均性能从72.4%提升至74.8%。这证明稳定的自我改进关键在于让模型学会构建结构上始终超越其当前能力的环境。
We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.