AI models often give the right answers but point to the wrong sources

2026-05-25 15:30·38天前·Jonathan Kemper

AI 摘要

北京大学研究人员发现，GPT 和 Gemini 等主流大语言模型在进行文档分析时，经常引用无法支持其答案的文本段落。即便答案本身正确，被引用的证据也常是错误的。研究人员将此现象称为“归因幻觉”，并指出这是法律和医疗等受监管领域的风险。为此，他们提出了首个系统性测试该问题的新基准 CiteVQA。

原文 · 未翻译

AI models often give the right answers but point to the wrong sources

Just because a language model nails a question about a PDF doesn't mean it actually found the answer where it claims to.

Researchers at Peking University and the Shanghai Artificial Intelligence Laboratory built a new benchmark called CiteVQA to expose this gap between getting the right answer and pointing to the right source. They call it "attribution hallucination."

Standard document analysis tests like DocVQA or MMLongBench-Doc only grade the final answer. They can't tell whether a model actually pulled information from the document or just guessed based on what it already knew. In law, financial audits, or medicine, though, traceability is what makes an AI output usable in the first place, the paper argues.

Pinpointing evidence

CiteVQA makes models back up every statement with a precise marker in the document. They have to point to the exact paragraph, table, or figure. A page number alone won't do. The dataset covers 1,897 questions across 711 PDFs from seven subject areas: 451 in English and 260 in Chinese. The documents average 40.6 pages each, way longer than most benchmarks.

Rather than hand-labeling everything, the team built an automated pipeline. It breaks documents into individual elements, has models like Gemini 3.0 Flash trace the chain of evidence, and then checks which pieces are truly needed. Each document gets pulled out on a trial basis. If the model can't answer the question without it, that document counts as essential.

The core metric is called Strict Attributed Accuracy. A model only gets points when the answer is correct and the citation lands on the right spot. Twenty current models were put through the test.

The best performer, Gemini-3.1-Pro-Preview, scored just 76 out of 100. GPT-5.4 often knew the right answer but couldn't show its work: 87.1 for raw answer quality, just 59 once correct citations were required.

Open-source models fared much worse. Qwen3-VL-235B-A22B, the strongest freely available system, managed 22.5 points. Smaller open models mostly landed below 10, making them "extremely risky" for regulated industries, the researchers say.

The Decoder：AI News（RSS）

55导出 Markdown

AI models often give the right answers but point to the wrong sources

2026-05-25 15:30·38天前·Jonathan Kemper

阅读原文· the-decoder.com

AI 摘要

原文 · 保持原样，未翻译

Just because a language model nails a question about a PDF doesn't mean it actually found the answer where it claims to.