LingxiDiagBench：一个用于中文精神科咨询与诊断中LLM基准测试的多智能体框架

2026-06-11 08:00·22天前

AI 摘要

LingxiDiagBench是一个多智能体基准框架，基于LingxiDiag-16K数据集（16,000个EMR对齐的合成咨询对话，覆盖12个ICD-10精神疾病类别），评估LLM在静态诊断推理和动态多轮中文精神科咨询中的表现。实验发现：LLM在二元抑郁-焦虑分类上准确率达92.3%，但抑郁-焦虑共病识别仅43.0%，12类鉴别诊断仅28.5%；动态咨询表现常低于静态评估，表明信息收集策略不足损害诊断质量；LLM-as-a-Judge评估的咨询质量与诊断准确性仅呈中等相关。数据集和框架已开源。

原文 · 未翻译

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

HuggingFace Daily Papers（社区热门论文）

51导出 Markdown

LingxiDiagBench：一个用于中文精神科咨询与诊断中LLM基准测试的多智能体框架

2026-06-11 08:00·22天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译