# 只需评判一次：单次前向传播多回复奖励建模

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-13 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnzj0j7u03onsl0fnek6wpc0
- 原文链接：https://arxiv.org/abs/2604.10966

## AI 摘要

研究团队提出一种判别式多模态奖励模型，可在单次前向传播中同时评分多个候选回复，突破传统方法需多次推理的局限。该设计通过分隔符连接多回复实现直接比较推理，带来N倍速度提升与FLOPs降低。基于4B视觉语言架构，该模型在六个基准测试中达到SOTA，包括新构建的MR^2Bench-Image（覆盖8个模型）和MR^2Bench-Video（基于94K众包判断的19个模型视频基准）。应用于GRPO强化学习时，其在训练稳定性和开放式生成质量上显著优于单回复奖励模型基线。

## 正文

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient N-way preference learning. The multi-response design also yields up to Ntimes wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable N-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR^2Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR^2Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR^2Bench-Image, MR^2Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.