DecMem:面向分钟级一致世界生成的解耦记忆架构
阅读原文· arxiv.org针对视频世界模型在长时程推理中难以保持细粒度时空一致性的挑战,本文提出了一种新颖的细粒度、可学习、可扩展的记忆架构 DecMem。研究指出了朴素可学习记忆在长程外推中存在计算效率低下和注意力分散两大局限。为此,DecMem 采用了稀疏全局记忆和锚定局部记忆的解耦设计,以高效访问全局历史并确保稳定的高质量外推。实验表明,DecMem 显著优于现有方法,通过确保精确高效的长期记忆,实现了高保真、高一致性的分钟级可控长视频生成。
Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.