SciPredict:LLM 能否预测自然科学领域的实验结果?
阅读原文· arxiv.org研究团队发布 SciPredict 基准测试,涵盖物理学、生物学和化学 33 个子领域的 405 项实验预测任务。评估显示,主流 LLM 预测准确率仅为 14-26%,虽略高于人类专家的 20%,但远未达到可靠指导实验的标准。更关键的是,模型无法校准预测置信度,无论自信与否,准确率均维持在 20% 左右;而人类专家在认为可预测时的准确率可从 5% 提升至 80%。研究表明,实现超人类科研能力不仅需要提升预测精度,更需建立对预测可靠性的准确认知。
Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from approx5% to approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict