# 如何与想象什么？--统一多模态模型中用于跨视图空间推理的视觉思考

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmpptvdat012vslm6ix91dhkx
- 原文链接：https://arxiv.org/abs/2605.27310

## AI 摘要

跨视图空间推理是视觉语言模型的薄弱环节，因其依赖语言推理而损失几何精度。视觉思考通过生成中间思考图像来解决此问题，但模型常忽略这些视觉证据。研究提出View Dropout训练策略，通过隐藏部分输入视图的应答区域，同时保持思考图像token可见，来促使模型利用思考图像进行回答。研究将视觉思考建模为“可学习性-信息量”权衡，并测试了三种思考图像变体。在合成场景训练并在五个真实世界基准测试评估后，结果表明全景视觉思考结合View Dropout是唯一既具信息量又可学习的配置，实现了最佳跨域泛化。

## 正文

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.