# Stream3D-VLM：支持增量几何先验的在线3D空间理解模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-05 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmq4qg4m600rwslt2fqzikiba
- 原文链接：https://arxiv.org/abs/2606.06891

## AI 摘要

Stream3D-VLM 是一种在线3D视觉语言模型，能从流式视频中实时进行空间理解。它基于LLM的下一token预测目标采用自回归流控制建模决定何时响应，通过轻量级Visual-Spatial Feature Integration（VSFI）模块逐步注入时序对齐的几何先验，并提出Geometry-Adaptive Voxel Compression（GAVC）模块用于视觉token的高效压缩。为缓解流式3D语言数据稀缺，构建了超过1M在线时空3D问答对的数据生成流程，并建立涵盖29个任务的benchmark。实验显示，该模型在在线和离线3D空间理解、推理与定位任务上均显著优于闭源及开源模型。

## 正文

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/
