原文 · 未翻译
Microsoft Research's Mirage gives video generation a persistent spatial memory that doesn't forget what's around the corner
Key Points
Mirage, a new video world model from Microsoft Research and several universities, keeps the spatial structure of generated scenes consistent even during long camera movements.
Instead of taking the expensive detour through pixel-based 3D point clouds, the system stores image features directly in a spatial memory within its internal latent space.
Mirage generates videos up to 10.5x faster and uses up to 55x less memory than comparable models. Moving objects are still filtered out of the memory.
Mirage is a new video world model that skips the costly detour through pixel-based memory. That speeds up generation and keeps a scene's spatial structure stable even during long camera moves. Researchers from several universities built it with Microsoft Research.
Video world models turn a starting frame and a camera path into plausible moving images, handy for simulations or as world simulators. But without some kind of memory, even strong generators lose track of space over time. A corner of a room you've already passed looks different when the camera swings back. Furniture shifts, and textures change.
Systems like Voyager, WonderWorld, and Spatia try to fix this with a 3D point cloud that gets fed a steady stream of color data. Every new generation step has to render that cloud and then translate the result back into the model's internal feature space. Microsoft's new paper calls this a double bottleneck: It eats compute, and information leaks out every time the data passes through pixel space.
Mirage takes a different approach. Rather than holding onto visible color points, it stores the internal image features the diffusion model already uses. Each feature gets a spot in 3D space, which turns it into an entry in spatial memory.
To generate a new viewpoint, the model projects this store straight onto the target camera and hands the result to the generator, skipping the step of rendering a point cloud and re-encoding it. The authors say this also slashes memory use, since the data sits in the model's compact internal resolution instead of at full image size.