一项基于172B token的研究测试了LLM在文档问答场景中的虚构答案频率。关键发现:最佳模型在32K上下文下虚构率1.19%;强模型通常为5%-7%;中等模型对不存在事实的虚构率达25%。当上下文扩展至200K时,所有模型至少虚构10%。更长上下文显著加剧幻觉。研究表明,幻觉不仅是检索失败,模型即便能正确找到事实,也易在事实缺失时过度作答。
This study tests how often LLMs invent answers when they should rely only on supplied documents.
The problem is that companies often use LLMs to answer questions from documents and they assume document-based LLM systems are safer because the model is given source material.
This study shows that no model fully avoided fabrication, because even the best model made up answers 1.19% of the time at 32K context.
For strong models, a more normal best-case rate was around 5% to 7%, while the middle model fabricated about 25% of answers to questions about facts that did not exist.
Longer context made the problem much worse, and at 200K context every tested model fabricated at least 10% of the time.
Shows that hallucination is not just a failure to retrieve the right sentence.