# World Tracing：超越可见面的生成式像素对齐几何表示

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmqffbx8r00ypsl2alv2d9sp8
- 原文链接：https://arxiv.org/abs/2606.13652

## AI 摘要

World Tracing 是一种生成式像素对齐几何表示，为每个输入像素预测有序的相机空间3D点栈，第一层对应可见表面，后续层表示从前到后的遮挡表面交点。该表示通过世界追踪扩散Transformer（WT-DiT）实例化，将多个几何层视为独立去噪token，经分解注意力和全局注意力耦合。采用像素空间流匹配和混合噪声调度训练，平衡可见表面重建与遮挡几何生成。在目标、场景和动态基准上，World Tracing在可见表面重建和完整几何生成方面均优于深度预测器和图像转3D生成器，并保持2D-3D对应，支持文本驱动的3D场景编辑、几何条件新视角视频合成及与纹理网格生成器的无缝集成。

## 正文

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.
