看到不等于知道：视觉语言模型（VLMs）是否知道何时不应回答空间问题（以及为什么）？

2026-05-28 08:00·36天前

AI 摘要

研究构建了 SpatialUncertain 评估框架，测试多种前沿视觉语言模型 (VLMs)。在遮挡和视角歧义两种空间观察挑战下，模型平均准确率分别约为 30% 和低于 10%，并常无法识别应转向的额外视角。研究主张评估重点需从回答正确性转向模型对何时放弃作答及如何寻求可靠证据的认知。

原文 · 未翻译

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

HuggingFace Daily Papers（社区热门论文）

61导出 Markdown

看到不等于知道：视觉语言模型（VLMs）是否知道何时不应回答空间问题（以及为什么）？

2026-05-28 08:00·36天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

看到不等于知道：视觉语言模型 （VLMs） 是否知道何时不应回答空间问题（以及为什么）？

看到不等于知道：视觉语言模型 （VLMs） 是否知道何时不应回答空间问题（以及为什么）？

看到不等于知道：视觉语言模型（VLMs）是否知道何时不应回答空间问题（以及为什么）？

看到不等于知道：视觉语言模型（VLMs）是否知道何时不应回答空间问题（以及为什么）？