# SPACENUM：重新审视视觉语言模型的空间数值理解

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-22 08:00
- AIHOT 分数：46
- AIHOT 链接：https://aihot.virxact.com/items/cmq5e15cb072jslt2m914h6mm
- 原文链接：https://arxiv.org/abs/2605.23898

## AI 摘要

SPACENUM 是一个统一框架，考察空间探索中的动态数值转换与空间推理中的静态布局两种设定。通过 Num2Space 和 Space2Num 双向任务，评估视觉语言模型（VLM）在视觉空间结构与语言数值表征之间的映射能力。实验表明，当前 VLM 在两种设定下均接近随机猜测，严重依赖浅层空间线索，无法建立稳定的坐标感知表征。显式推理仅带来边际提升，微调可部分改善空间数值理解。

## 正文

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.