# 视觉锚定推理（Thinking with Visual Grounding）

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqkbvrcp04ucslhimuoeg7o1
- 原文链接：https://arxiv.org/abs/2606.16122

## AI 摘要

提出视觉锚定推理，让VLM在生成自然语言推理步骤时，显式输出点或框来锚定每一步依赖的图像区域。训练管道从正确推理轨迹提取对象，用SAM3-based agent获取锚定掩码，派生点与框监督。进一步提出锚定感知强化学习，结合答案正确性奖励和密集锚定奖励。在2个计数基准和4个空间推理基准上，Gemma3-4B-IT应用后性能提升，空间推理任务上匹配或超越Gemma3-27B-IT。点锚定适用于计数，框锚定在空间任务受益于显式锚定奖励。

## 正文

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.
