# SR-REAL：空间视觉语言模型的双路径推理增强

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：40
- AIHOT 链接：https://aihot.virxact.com/items/cmqj0n3n60626sl5wzw3swo0f
- 原文链接：https://arxiv.org/abs/2606.17539

## AI 摘要

SR‑REAL 为空间 VLM 配备两条互补推理路径：纯语言推理（LOR）和检测后推理（DTR）。LOR 执行逐步语言演绎，DTR 先通过区域 token 检测 3D 几何线索（中心点或边界框），再进行几何推理。框架先经冷启动有监督微调构建两条路径的思维链监督，随后用准确率和格式奖励进行强化学习优化，DTR 额外使用基于离散中心的检测奖励。在多个空间基准上，SR‑REAL 显著优于基线：单个 RL 训练模型支持两种路径，联合训练实现互相增强，且模型无需调优即可跨数据集和领域泛化。

## 正文

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.
