只需最少RLVR训练：通过秩-1轨迹外推大语言模型

2026-05-20 08:00·44天前

AI 摘要

研究发现，强化学习与可验证奖励训练大语言模型时，权重变化轨迹具有极低的秩且高度可预测，性能增益主要由秩-1逼近捕获，且随训练步骤线性演化。基于此，提出RELEX方法，仅需从短观察窗口估计秩-1子空间，通过线性外推预测后续检查点，无需学习模型。在多个模型上，RELEX仅需15%的完整训练步骤，即可在域内和域外基准上匹配或超越RLVR性能，并能以零额外成本外推至观察窗口的10-20倍，性能持续提升。成功源于秩-1投影实现的“去噪”效应，有效剔除随机优化噪声。

原文 · 未翻译

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20times beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

HuggingFace Daily Papers（社区热门论文）

69导出 Markdown

只需最少RLVR训练：通过秩-1轨迹外推大语言模型

2026-05-20 08:00·44天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译