LLM论文评审的人类对齐性与可博弈性研究
阅读原文· arxiv.org该研究基于2025 ACL Rolling Review (ARR)的论文,实证评估了大语言模型(LLM)生成的论文评审意见。研究发现,LLM评审与人类评审的对齐程度有限,且在不同提示词和模型之间存在显著差异。此外,当作者采用基于LLM评审意见的迭代修改工作流时,可以有效“博弈”LLM评审,使高达35%的论文的总分获得统计意义上的显著提升。
LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.