# 超越3D视觉问答：将3D空间先验注入视觉语言模型以增强几何推理

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmpqd5x5k03y4slnous1xa5wj
- 原文链接：https://arxiv.org/abs/2605.30231

## AI 摘要

视觉语言模型通常缺乏鲁棒的3D空间推理能力。现有方法或依赖3D视觉问答数据集进行微调，导致过拟合；或集成专用3D编码器，显得笨重且不灵活。本研究提出GASP框架，直接将基础几何先验注入大语言模型的Transformer层。该框架利用大规模视频场景的真值几何数据，通过一个小型对应头进行双目标训练：对比损失强化2D视角不变性，深度一致性监督解决3D几何歧义。分析表明，标准模型内部的对应匹配准确率极低（常低于5%）；GASP训练后，该指标峰值超过70%，且时间鲁棒性超过85%。这在下游基准测试中带来显著提升，包括在All-Angles Bench上提升+18.2%，在VSI-Bench上提升+29.0%，且无需任何3D VQA数据训练。

## 正文

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.
