RUBRIC-ARROW:面向非可验证领域LLM后训练的逐点评分标准奖励建模
阅读原文· arxiv.orgRUBRIC-ARROW是一个用于解决大语言模型在主观、非可验证领域进行后训练时,评分标准奖励模型所面临的评分僵局问题的交替式奖励建模框架。该框架联合训练一个评分标准生成器和一个基于该标准的评判者,其强化学习阶段仅使用成对偏好数据。核心方法在于采用概率评分规则以减少平局,并结合交替式GRPO方案,利用阶段性偏好奖励训练逐点评分评估器。实验表明,该框架在奖励建模准确性上具备竞争力,并能为下游策略后训练带来一致收益。
Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.