# GridVQA-X：评估多模态可解释性方法的框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-02 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmqwqyy1k0073slikxgca9xbw
- 原文链接：https://arxiv.org/abs/2606.14740

## AI 摘要

GridVQA-X是首个专门评估跨模态可解释性的诊断框架。它通过封闭世界合成逻辑生成具有数学保证的解释，并训练了相同架构的对照模型：M_pure（学习稳健的空间关系推理）和M_{spur}（被迫依赖跨模态捷径）。实验发现，现有广泛使用的可解释性方法无法区分这两种模型，无法捕捉真正的跨模态协同，反而可能错误表示多模态模型的实际决策过程，凸显了当前多模态可解释性方法在忠实捕捉跨模态推理方面的关键缺陷。

## 正文

With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not kept pace with the multimodal surge. While recent Multimodal Explainable AI (MxAI) methods generate explanations to attribute the interaction between different modalities, current evaluation protocols lack the ground truth required to distinguish between true cross-modal reasoning (e.g., spatial composition) and shallow cross-modal shortcuts (e.g., Bag-of-Words attribute matching). It remains unknown whether MxAI methods faithfully capture synergistic interactions or merely hallucinate reasoning on models acting as simple feature detectors. In this paper, we introduce GridVQA-X, the first diagnostic framework specifically designed to evaluate cross-modal explainability. Unlike natural datasets, GridVQA-X leverages a closed-world synthesis logic to generate unique, mathematically guaranteed explanations. We utilize this controlled environment to train paired ground-truth models on identical architectures: M_{pure}, which learns robust spatial-relational reasoning and M_{spur}, which is structurally forced to rely on cross-modal shortcuts. This behavioral divergence creates a rigorous testbed: a faithful explainer must report distinct reasoning pathways for each model. Our findings reveal that widely used methods fail to distinguish between models relying on genuine spatial-relational reasoning and those exploiting cross-modal shortcuts, highlighting a critical gap in capturing true cross-modal synergy and misrepresenting how multimodal models actually make decisions.