# PRISM：评估大语言模型同行评审者的多维基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmpqs64tr07rislno27zvp242
- 原文链接：https://arxiv.org/abs/2605.26730

## AI 摘要

针对机器学习论文激增给同行评审带来的压力，研究者提出了PRISM基准框架。该框架从分析深度、新颖性评估、缺陷识别与主要问题优先级、多维建设性四个维度评估评审质量，其方法基于论点挖掘、检索增强验证和共识评分。在对ICLR、ICML和NeurIPS评审的测试中，PRISM发现大语言模型在某些单维度上表现可比甚至超越人类（如分析深度相当），但没有一个系统能在所有维度上同时达到人类的均衡表现，且各系统存在不同盲区。结论认为大语言模型评审者最适合作为人类评审的针对性补充，而非独立替代品。

## 正文

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.
