ForeSci：评估LLM智能体的前瞻性AI研究判断

2026-06-04 08:00·29天前

AI 摘要

ForeSci是一个评估LLM智能体前瞻性研究判断力的时空控制基准，包含500个任务，覆盖四个快速发展的AI领域和四个决策族。每个任务配有截止时间对齐的离线知识库，训练数据止于截止点，后续论文仅用于验证。评估了原生LLM、Hybrid RAG和三种研究智能体适配方法在四个骨干模型上的表现。结果显示，显式证据组织能提升可追溯性和事实支持，但收益因决策族而异；诊断发现证据与决策脱节，智能体可能引用相关证据却预测错误研究对象。该基准将前瞻性AI研究判断转化为可控评估系统。

原文 · 未翻译

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

HuggingFace Daily Papers（社区热门论文）

69导出 Markdown

ForeSci：评估LLM智能体的前瞻性AI研究判断

2026-06-04 08:00·29天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

ForeSci： 评估LLM智能体的前瞻性AI研究判断

ForeSci： 评估LLM智能体的前瞻性AI研究判断

ForeSci：评估LLM智能体的前瞻性AI研究判断

ForeSci：评估LLM智能体的前瞻性AI研究判断