SoundnessBench：你的AI科学家真的能分辨好的研究想法和坏的吗？

2026-05-28 08:00·36天前

AI 摘要

SoundnessBench 是一个包含1,099个机器学习研究提案的基准，用于评估大语言模型（LLM）判断研究想法方法论可行性的能力。在对12个前沿LLM的测试中发现普遍存在乐观偏差：标准提示下模型常将低合理性提案误判为合理，激进提示则会将错误从假阳性转为假阴性。对照实验表明这种行为并非由单一混淆因素造成。结果表明，当前LLM尚不适合作为独立的科研严谨性初筛评估工具。

原文 · 未翻译

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

HuggingFace Daily Papers（社区热门论文）

55导出 Markdown

SoundnessBench：你的AI科学家真的能分辨好的研究想法和坏的吗？

2026-05-28 08:00·36天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译