# 面向流式3D重建的几何上下文Transformer

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-15 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo1jd5hu00spslrr470vhxtk
- 原文链接：https://arxiv.org/abs/2604.14141

## AI 摘要

研究团队发布LingBot-Map，一款基于几何上下文Transformer（GCT）的流式3D重建基础模型。其创新注意力机制整合锚点上下文、姿态参考窗口与轨迹记忆，分别实现坐标定位、密集几何线索提取和长程漂移校正。该系统在518×378分辨率输入下保持约20 FPS的推理速度，可稳定处理超10,000帧的长序列，且流式状态紧凑。多项基准测试表明，其性能优于现有流式及迭代优化方法。

## 正文

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.