基于价值梯度流的强化学习
阅读原文· arxiv.org针对行为正则化强化学习中现有方法难以扩展至大型生成模型或过于保守的问题,研究者提出价值梯度流(VGF)新范式。VGF将问题转化为最优传输问题,通过离散梯度流求解,利用价值梯度引导参考分布粒子,隐式实现正则化。该方法无需显式策略参数化,支持通过调整传输预算实现自适应测试时扩展。实验显示,VGF在D4RL、OGBench离线RL基准及LLM RL任务上均达到SOTA性能,显著优于先前方法。
We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.