# 无姿态多视图的实例结构化3D Token化框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-28 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmr1vhkjz02exsl8zoi0uzcca
- 原文链接：https://arxiv.org/abs/2606.29513

## AI 摘要

一项前馈式3D场景重建框架，直接从无姿态多视图图像将场景分解为实例结构化3D token组。每组包含一个捕获实体级身份的实例token和多个编码局部几何与外观的锚点token，解码为一组3D高斯。通过可微渲染联合重建与分割监督学习，无需3D标注。该模型在类无关实例分割上超越逐场景优化基线，在新视图合成上具有竞争力。token组可直接实现实例级场景编辑（移除、平移、插入对象）以及高效开放词汇3D实例检索，检索复杂度随实例数而非基元数增长。

## 正文

A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images -- compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing -- removing, translating, or inserting objects by operating on their groups -- as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.
