DelveAgent与PhySciBench：物理科学深度研究的多智能体框架与综合基准

2026-06-17 08:00·16天前

AI 摘要

PhySciBench是面向物理科学研究的基准，包含200道专家精选的物理和化学问题，覆盖六类真实科研任务。评测显示，最强基线Gemini Deep Research准确率仅33.5%。失败案例暴露长推理链脆弱、跨步骤知识迁移有限、缺乏物理接地自我验证等缺陷。为此提出的DelveAgent是一个模块化多智能体框架，配备自适应规划循环、双粒度记忆和层次化物理接地反思机制。在四个科学基准上，DelveAgent将准确率提升最多7.5个百分点，推理成本降至最强基线的约三分之一。

原文 · 未翻译

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.

HuggingFace Daily Papers（社区热门论文）

51导出 Markdown

DelveAgent与PhySciBench：物理科学深度研究的多智能体框架与综合基准

2026-06-17 08:00·16天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译