# 基于梯度的 RLVR 稳定性分析与 WAPO

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：54
- AIHOT 链接：https://aihot.virxact.com/items/cmqhgiko20473sle17r9l7xdu
- 原文链接：https://arxiv.org/abs/2606.16154

## AI 摘要

带可验证奖励的强化学习（RLVR）可提升语言模型推理能力，但 GRPO 式优化易出现不稳定性。通过 token 级梯度动态分析，发现更新受优势符号与当前策略下 token 分布共同影响。为此提出胜者优势策略优化（WAPO），一种仅对正优势补全进行更新的在线剪切策略梯度目标。在数学推理与多跳问答基准上，WAPO 提升了训练稳定性，并在多个模型族上达到或超越基线表现。完整代码已开源。

## 正文

Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at https://github.com/layer6ai-labs/wapo.
