MedMisBench:大语言模型在误导性医学上下文下的认知韧性评估
阅读原文· arxiv.org大语言模型在医学考试中已达专家级水平,但MedMisBench基准测试揭示其结构性脆弱:在误导性上下文中,模型平均准确率从原题的71.1%骤降至38.0%,攻击成功率达51.5%。MedMisBench包含10,932道医学题和48,889组误导性上下文–选项对,覆盖医学推理、智能体能力和患者旅程评估。最有效的攻击是权威式虚假陈述(69.5%)和例外投毒声明(64.1%)。来自7国的14名临床医生评审认定38.2%的案例存在严重潜在危害。
Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.