# Z-Reward：通过推理内化分数分布超越标量奖励

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmq8ws2lj06igslldwjzn3uhf
- 原文链接：https://arxiv.org/abs/2606.09076

## AI 摘要

Z-Reward 是一种教师-学生奖励建模框架，用于文生图后训练。教师为 27B VLM，采用 Group-wise Direct Score Optimization (GDSO) 结合策略梯度奖励与分数分布监督；学生通过 Reasoning-Internalized Score Distillation (RISD) 将教师推理条件分布压缩进 9B VLM，推理时无需显式推理链。在内部评测集上，27B 教师达 89.6% 人类偏好准确率，超越 SFT、RewardDance 和 GRPO；9B 学生达 88.6%，超越 O

## 正文

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.
