并非每个评分准则都同等有效:用于RLVR的策略感知评分准则奖励
阅读原文· arxiv.org针对基于评分准则的强化学习(RLVR)奖励机制中静态权重的局限性,本研究提出POW3R框架。该框架在训练过程中动态调整各准则的奖励权重,以强调那些能有效区分当前策略输出的准则,同时保留整体人类权重分布。实验在三个基础策略和两个数据集上进行,结果显示POW3R在30项比较中赢得24项,提高了平均评分准则奖励和严格完成率,并将达到相同性能所需的训练步数减少2.5至4倍。POW3R通过策略感知的权重优化,使奖励信号更具信息性,提升了RLVR的训练效率。
Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 24 of 30 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.5--4times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.