# 同一问题，不同来源，不同答案：医疗多来源RAG系统的来源依赖性审计

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmpwt12lv00frsl79q50yn4ao
- 原文链接：https://arxiv.org/abs/2605.29084

## AI 摘要

检索增强生成系统处理多来源语料时，可能因检索来源不同而对同一问题给出不同答案，这是一种现有评估体系无法诊断的失效模式。研究团队在医疗患者教育场景发布了三个工具：基准TransplantQA，为真实患者问题提供基于多机构手册的参考答案；分层检索与审计策略HERO-QA；以及一个基于经验证的5标签分类体系的结构化评估器，用于评分来源间关系。大规模审计显示，更优的检索能力所暴露出的来源分歧远高于此前估计。该框架具有领域通用性。

## 正文

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.