原文 · 未翻译
AI models often give the right answers but point to the wrong sources
Just because a language model nails a question about a PDF doesn't mean it actually found the answer where it claims to.
Researchers at Peking University and the Shanghai Artificial Intelligence Laboratory built a new benchmark called CiteVQA to expose this gap between getting the right answer and pointing to the right source. They call it "attribution hallucination."
Standard document analysis tests like DocVQA or MMLongBench-Doc only grade the final answer. They can't tell whether a model actually pulled information from the document or just guessed based on what it already knew. In law, financial audits, or medicine, though, traceability is what makes an AI output usable in the first place, the paper argues.
Pinpointing evidence
CiteVQA makes models back up every statement with a precise marker in the document. They have to point to the exact paragraph, table, or figure. A page number alone won't do. The dataset covers 1,897 questions across 711 PDFs from seven subject areas: 451 in English and 260 in Chinese. The documents average 40.6 pages each, way longer than most benchmarks.
Rather than hand-labeling everything, the team built an automated pipeline. It breaks documents into individual elements, has models like Gemini 3.0 Flash trace the chain of evidence, and then checks which pieces are truly needed. Each document gets pulled out on a trial basis. If the model can't answer the question without it, that document counts as essential.
The core metric is called Strict Attributed Accuracy. A model only gets points when the answer is correct and the citation lands on the right spot. Twenty current models were put through the test.
The best performer, Gemini-3.1-Pro-Preview, scored just 76 out of 100. GPT-5.4 often knew the right answer but couldn't show its work: 87.1 for raw answer quality, just 59 once correct citations were required.
Open-source models fared much worse. Qwen3-VL-235B-A22B, the strongest freely available system, managed 22.5 points. Smaller open models mostly landed below 10, making them "extremely risky" for regulated industries, the researchers say.