# DelveAgent与PhySciBench：物理科学深度研究的多智能体框架与综合基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-17 08:00
- AIHOT 分数：51
- AIHOT 链接：https://aihot.virxact.com/items/cmqq0ff7g05h3slp50p03q7m7
- 原文链接：https://arxiv.org/abs/2606.18648

## AI 摘要

PhySciBench是面向物理科学研究的基准，包含200道专家精选的物理和化学问题，覆盖六类真实科研任务。评测显示，最强基线Gemini Deep Research准确率仅33.5%。失败案例暴露长推理链脆弱、跨步骤知识迁移有限、缺乏物理接地自我验证等缺陷。为此提出的DelveAgent是一个模块化多智能体框架，配备自适应规划循环、双粒度记忆和层次化物理接地反思机制。在四个科学基准上，DelveAgent将准确率提升最多7.5个百分点，推理成本降至最强基线的约三分之一。

## 正文

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.
