# ACL-Verbatim：面向研究的无幻觉问答

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-20 08:00
- AIHOT 分数：41
- AIHOT 链接：https://aihot.virxact.com/items/cmpwoqq8804t5slsny9gfmyd6
- 原文链接：https://arxiv.org/abs/2605.21102

## AI 摘要

针对学术研究中大语言模型（LLM）存在幻觉的问题，研究者将抽取式问答系统VerbatimRAG应用于ACL Anthology论文集，实现用户查询到文档原文片段的直接映射。团队构建了一个新基准数据集，由NLP研究人员基于ScIRGen方法生成的合成用户查询进行人工标注，用于训练和评估多种抽取式模型。其中，一个参数规模为150M的ModernBERT分类器，在基于ScIRGen方法生成的查询和论文片段上进行训练后，在词级F1分数上达到53.6，超越了被评估的最强LLM抽取器（48.7）。

## 正文

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).
