ClaimDiff-RL:通过视觉声明比较实现细粒度图像描述强化学习
阅读原文· arxiv.org针对图像描述生成强化学习中的奖励粒度不足问题,提出 ClaimDiff-RL 框架。该方法将整体序列奖励拆解为原子级的视觉声明差异作为奖励单位。给定图像、生成描述与参考描述,多模态评判器枚举两者间可验证的视觉声明差异,分配错误类型与严重程度,并据此构建奖励。这使得模型幻觉与遗漏关键事实能够被独立衡量与调优。实验表明,该框架在多个基准上改善了事实性与覆盖率的平衡,在物体计数、空间关系等细粒度能力上甚至超越了 Gemini-3-Pro-Preview。
Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.