# SDR：基于集合距离的胸部X光报告生成奖励方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-30 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmq6xxxkk00mdslbhvrmhmbwj
- 原文链接：https://arxiv.org/abs/2606.00440

## AI 摘要

针对标准精确匹配奖励不适用胸部X光报告生成的问题，提出SDR方法。将报告分割为句子，用冻结的句子Transformer嵌入为无序集合，以生成与参考嵌入间的集合到集合距离作为连续、置换不变的奖励。在Qwen3-VL-2B/4B和Gemma3-4B上通过GRPO后训练，BERTScore、RadGraph F1和CheXbert F1分别相对提升6.80%、7.82%和4.45%。同一距离用于测试时best-of-N选择，在Mistral-Small、Gemini-2.5 Flash-Lite和GPT-4o-mini上BERTScore平均相对提升16.4%。作为流式信号，可在生成中修剪低分候选，减少超过50%的生成token且保持质量。代码已公开。

## 正文

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}.