# 评估人工智能预测科学进展的能力：CUSP基准研究

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-21 08:00
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmph6m46p0lrnsljwxi0c7bhq
- 原文链接：https://arxiv.org/abs/2605.22681

## AI 摘要

本研究引入CUSP基准，基于4760个科学事件评估AI预测进展的能力。测试发现，当前前沿模型存在系统性局限：虽然能从候选中识别合理方向，但无法可靠预测进展能否实现，且常错误估计时间。性能在不同领域差异显著，AI进展比其他学科更易预测。模型表现对训练截止时间不敏感，表明限制不仅源于训练知识。增加事前知识可提升性能，但无法达到完全信息状态。模型还表现出过度自信和响应偏差。总体而言，当前AI作为科学进展预测工具尚不成熟。

## 正文

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.
