Chartographer:用于评估视觉语言模型的反事实图表生成
阅读原文· arxiv.org现有图表问答基准测试存在局限,模型可能依赖捷径或背景知识而非视觉推理来回答问题。为严格评估视觉推理能力,研究提出了“反事实图表”方法,即在保持图表-问答任务不变的前提下,改变底层图表及其答案。为此,研究引入了Chartographer框架,该框架能将图表逆向工程为可执行代码,验证重建保真度,生成种子可控的变体,并从可执行的问答逻辑中推导新答案。通过将此框架应用于现有数据集,研究评估了专有及开源视觉语言模型的变化敏感性与泛化能力。结果表明,反事实图表揭示了单一图表测试所隐藏的失败:模型在正确回答原始图表后,往往无法在更新图表需要全新视觉推理路径时成功泛化。
Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.