# MedConclusion：基于结构化摘要的生物医学结论生成基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-07 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo84qcqu03wfslml3x3mxdap
- 原文链接：https://arxiv.org/abs/2604.06505

## AI 摘要

研究团队发布MedConclusion基准数据集，包含570万PubMed结构化摘要，用于测试大语言模型从结构化生物医学证据中推断科学结论的能力。该数据集将摘要非结论部分与作者撰写的结论配对，提供自然监督信号，并包含期刊类别、SJR等元数据支持子群分析。初步评估显示，结论写作与摘要写作行为差异显著，当前自动指标难以区分强模型表现，且LLM评判者身份会显著影响评分结果。

## 正文

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large-scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.
