NeuWorld:通过神经隐式场景实现交互式世界探索
阅读原文· arxiv.orgNeuWorld提出场景中心范式Walking in the Implicit,将交互式视频生成的滚动变量从帧级潜变量替换为固定长度的可渲染隐式状态NIS。模型利用Transformer VAE从稀疏有姿态帧学习局部锚定的NIS,并通过扩散Transformer根据未来相机轨迹和几何感知历史演化NIS。通过复用VAE编码器作为统一条件器,将相机、参考图像和历史线索映射到同一NIS模态,避免外部异构编码器。模型在公开姿态视图数据上从头训练,未使用预训练视频骨干或3D重建器,实现了强长程一致性和有利推理效率。
Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.