# AutoMedBench：面向医疗自主研究的智能体AI模型基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpxn3gho05ibslck3192rsd9
- 原文链接：https://arxiv.org/abs/2606.01961

## AI 摘要

AutoMedBench 是一个工作流感知的基准测试，用于评估自主医疗AI研究智能体在完整研究流程中的表现。该基准涵盖医学影像与多模态推理任务，组织智能体执行统一的五阶段工作流：规划、设置、验证、推理与提交。任务涉及分割、图像增强、视觉问答、报告生成和病灶检测五大赛道，每个任务设有Lite与Standard两个难度级别，单次运行平均包含33个智能体回合。结果表明，验证阶段是当前智能体最薄弱的环节，而设置阶段表现最强。错误分析显示，验证与提交失败分别占37.7%和38.1%，任务理解错误仅占0.9%；出现错误代码的运行总分平均比无错误运行低48%。

## 正文

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.
