# SpatialAct： 探测VLM智能体在3D场景中的空间推理至行动能力

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmpzffifb030aslkp8zv2r33z
- 原文链接：https://arxiv.org/abs/2605.31148

## AI 摘要

SpatialAct是一个基于模拟器的基准，用于评测视觉语言模型（VLM）智能体在3D场景中的行动条件空间推理。基准从多轮交互改进任务出发，设计了单步错误检测与修复任务及五项基础空间能力任务。实验显示，当前VLM在孤立空间推理任务上表现良好，但在多轮反馈中难以维持一致的空间信念并产生可靠行动，表现显著低于人类。结果表明，即便底层控制被抽象，现有VLM智能体仍缺乏在行动导致环境变化下的鲁棒空间状态跟踪能力。

## 正文

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce SpatialAct, a simulator-grounded benchmark for probing action-conditioned spatial reasoning in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.