看见不等于共享:视觉语言模型在不对称对话中高估共同基础
阅读原文· arxiv.org基于HCRC MapTask对话中13077条标注指代的研究发现,视觉语言模型(VLM)难以区分对话参与者间“可能共享”与“已共享”的信息。提供真实地图图像会提升整体性能,但导致模型过度预测对齐;文本描述再现该偏差,非信息性图像则完全抑制对齐预测,表明偏差来源于任务相关地图内容而非视觉通道。校准分析与指代链追踪显示,模型依赖地图上的静态指代线索,而非通过对话历史追踪接地进程。该现象在Qwen3-VL-8B-Instruct上最为显著,另四个来自两种架构族的模型也有不同程度表现。地图内容无论是视觉还是文本呈现,均被模型当作相互理解的证据,混淆了潜在与已建立的共同基础。
In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be shared from what has been shared between dialogue participants through grounding. We formulate this as an interpretation-matching task on 13,077 annotated reference expressions from HCRC MapTask dialogues, and evaluate VLMs under systematically controlled manipulations of dialogue context and map-information access. Our results show that providing authentic map images improves overall performance but shifts models toward over-predicting alignment. Textual descriptions of the same map content reproduce this bias, while non-informative images suppress alignment predictions entirely, indicating that the bias is driven by task-relevant map content, not the visual channel. This improvement comes at the cost of degraded accuracy on non-aligned cases. Calibration analysis and reference-chain tracking further suggest that models rely on static referential cues on the maps rather than tracking how grounding unfolds through dialogue history. We observe these patterns most clearly in Qwen3-VL-8B-Instruct and, to varying degrees, in four additional models from two architecture families. In models that exhibit the bias, map content, whether presented visually or textually, is treated as evidence of mutual understanding, conflating potential with established common ground.