微软研究院 Mirage：赋予视频生成持久空间记忆，不遗忘"转角后的场景"

2026-06-14 21:58·18天前·Jonathan Kemper

AI 摘要

微软研究院与多所高校联合开发的视频世界模型 Mirage 将场景信息直接存储在潜在空间中，而非基于像素的点云。这大幅降低了计算时间和图形显存消耗，同时能在长镜头移动中保持场景空间一致性。不过，该模型目前仍无法可靠地跨片段跟踪运动物体。

原文 · 未翻译

Microsoft Research's Mirage gives video generation a persistent spatial memory that doesn't forget what's around the corner

Key Points

Mirage, a new video world model from Microsoft Research and several universities, keeps the spatial structure of generated scenes consistent even during long camera movements.

Instead of taking the expensive detour through pixel-based 3D point clouds, the system stores image features directly in a spatial memory within its internal latent space.

Mirage generates videos up to 10.5x faster and uses up to 55x less memory than comparable models. Moving objects are still filtered out of the memory.

Mirage is a new video world model that skips the costly detour through pixel-based memory. That speeds up generation and keeps a scene's spatial structure stable even during long camera moves. Researchers from several universities built it with Microsoft Research.

Video world models turn a starting frame and a camera path into plausible moving images, handy for simulations or as world simulators. But without some kind of memory, even strong generators lose track of space over time. A corner of a room you've already passed looks different when the camera swings back. Furniture shifts, and textures change.

Systems like Voyager, WonderWorld, and Spatia try to fix this with a 3D point cloud that gets fed a steady stream of color data. Every new generation step has to render that cloud and then translate the result back into the model's internal feature space. Microsoft's new paper calls this a double bottleneck: It eats compute, and information leaks out every time the data passes through pixel space.

Mirage takes a different approach. Rather than holding onto visible color points, it stores the internal image features the diffusion model already uses. Each feature gets a spot in 3D space, which turns it into an entry in spatial memory.

To generate a new viewpoint, the model projects this store straight onto the target camera and hands the result to the generator, skipping the step of rendering a point cloud and re-encoding it. The authors say this also slashes memory use, since the data sits in the model's compact internal resolution instead of at full image size.

The Decoder：AI News（RSS）

45导出 Markdown