# Lyra 2.0：可探索的生成式 3D 世界

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-14 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnzj0j7u03orsl0f023stx8n
- 原文链接：https://arxiv.org/abs/2604.13036

## AI 摘要

Lyra 2.0 是一个用于生成大规模可探索 3D 世界的框架，通过生成相机控制视频并结合前馈重建技术实现。针对长轨迹生成中的空间遗忘和时间漂移问题，该框架维护每帧 3D 几何用于信息路由以检索历史帧并建立对应关系，同时采用自增强历史训练策略使模型学会纠正误差而非累积漂移。这些方法显著延长了 3D 一致的视频轨迹，进而可微调重建模型以可靠地恢复高质量 3D 场景。

## 正文

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.