# NormGuard：流匹配强化学习中保持奖励的规范约束

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-26 08:00
- AIHOT 分数：43
- AIHOT 链接：https://aihot.virxact.com/items/cmqynkmvl0058slo1qkhavufa
- 原文链接：https://arxiv.org/abs/2606.27771

## AI 摘要

流匹配生成模型在强化学习后训练中，速度范数膨胀5%-15%导致感知质量下降，而推理时重缩放无法修复。NormGuard引入铰链惩罚，仅在速度范数超过参考值时激活，可加性组合到任何速度局部损失之上。在两种基础模型、三种后训练方法（NFT、AWM、DPO）和两种奖励代理上，NormGuard一致提升MLLM评判的图像质量和逼真度，同时保持奖励，且收益在少步推理下进一步放大，并非由早停解释。

## 正文

Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of this drift: across three post-training methods (NFT, AWM, DPO), RL fine-tuning inflates the per-step velocity norm |v_θ| by 5% to 15% relative to the reference. A form of norm inflation has been studied in classifier-free guidance (CFG), where rescaling the velocity back to a reference norm at inference time can mitigate the resulting artifacts. However, this inference-time correction does not transfer cleanly to RL: rescaling v_θ to match |v_{ref}| at inference time neither improves reward nor fixes the quality degradation, because the inflation is co-adapted into the model weights. Furthermore, an adjoint sensitivity analysis shows that velocity magnitude rescaling carries no coherent first-order reward signal at the batch level, indicating that suppressing norm inflation is unlikely to remove a consistently reward-carrying component. Since inference-time renormalization fails while norm suppression carries no reward cost, training-time intervention is the appropriate strategy. Together, these findings motivate \methodname, a hinge penalty that activates only when |v_θ| exceeds |v_{ref}| and composes additively with any velocity-local base loss. Across two base models, three post-training methods, and two reward proxies, \methodname consistently improves MLLM-judged image quality and forensic realism while preserving reward, with gains that amplify under few-step inference and are not explained by early stopping.
