LLM 作为评审在科学新颖性评估中的局限性

2026-06-10 08:00·23天前

AI 摘要

研究引入 RQ-Bench 基准，基于 arXiv 论文构建作者锚定的研究问题（RQ），用于测试新颖性判断。使用大语言模型进行独立或对比评审时，LLM 一致将模型生成的 RQ 评为高度新颖，产生“新颖性幻觉”，在对比评估中偏好更强。但领域专家得出相反结论，更偏好作者锚定的参考问题。许多生成 RQ 狭窄或受限于来源，LLM 评审常忽略该维度。LLM 评审与人类专家的矛盾结论对基于 LLM 评估科学新颖性的可靠性提出严重质疑。

原文 · 未翻译

LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.

HuggingFace Daily Papers（社区热门论文）

63导出 Markdown

LLM 作为评审在科学新颖性评估中的局限性

2026-06-10 08:00·23天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译