重新思考LLM强化学习中的散度正则化

2026-06-08 08:00·25天前

AI 摘要

针对LLM强化学习离线策略训练中信任区域控制问题，现有PPO/GRPO的比率裁剪难以准确表示长尾词汇分布偏移，DPPO虽改用散度边界但依赖硬掩码，丢弃边界外的梯度。本文提出DRPO，采用平滑的advantage加权二次正则化替代硬掩码，保留DPPO的信任区域几何结构，产生有界连续梯度权重，衰减有害发散更新并在边界外提供修正信号。实验表明DRPO提升了LLM RL训练的稳定性和效率。

原文 · 未翻译

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

HuggingFace Daily Papers（社区热门论文）

63导出 Markdown

重新思考LLM强化学习中的散度正则化

2026-06-08 08:00·25天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译