耶鲁大学与芝加哥大学通过11,683篇真实论文构建受控测试:为LLM提供每篇论文的邻近前期工作作为起点,要求其提出新的动机和方法,再与人类真实想法比较。关键发现:差距不在想法质量,而在想法范围——人类想法广泛分布于解释机制、测试失败、测量证据等多种模式;仅12.1%的人类想法主要是连接不同工作,而LLM中这一比例高达47.1%–64.2%(约为人类的4–5倍)。额外推理反而强化了该模式,表明LLM倾向于打磨熟悉配方而非探索更多样化的研究手法。
This Yale + University of Chicago paper shows that real gap between LLM generated research ideas vs humans is not idea quality, but idea range: LLMs think narrower than human researchers.
The researchers built a controlled test from 11,683 real papers, using each paper's nearby prior work as the shared starting point.
They asked models to propose a new motivation and method from those same prior papers, then compared those ideas with the real human paper ideas.
Instead of asking whether 1 idea looked novel, they labeled each idea by what gap it noticed and what kind of contribution it made.