RUBRIC-ARROW：面向非可验证领域LLM后训练的逐点评分标准奖励建模

2026-05-27 08:00·37天前

AI 摘要

RUBRIC-ARROW是一个用于解决大语言模型在主观、非可验证领域进行后训练时，评分标准奖励模型所面临的评分僵局问题的交替式奖励建模框架。该框架联合训练一个评分标准生成器和一个基于该标准的评判者，其强化学习阶段仅使用成对偏好数据。核心方法在于采用概率评分规则以减少平局，并结合交替式GRPO方案，利用阶段性偏好奖励训练逐点评分评估器。实验表明，该框架在奖励建模准确性上具备竞争力，并能为下游策略后训练带来一致收益。

原文 · 未翻译

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

HuggingFace Daily Papers（社区热门论文）

62导出 Markdown

RUBRIC-ARROW：面向非可验证领域LLM后训练的逐点评分标准奖励建模

2026-05-27 08:00·37天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译