# 微软研究院 Mirage：赋予视频生成持久空间记忆，不遗忘"转角后的场景"

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-06-14 21:58
- AIHOT 分数：45
- AIHOT 链接：https://aihot.virxact.com/items/cmqdv7vsb00a9slrdr7s2h12c
- 原文链接：https://the-decoder.com/microsoft-researchs-mirage-gives-video-generation-a-persistent-spatial-memory-that-doesnt-forget-whats-around-the-corner

## AI 摘要

微软研究院与多所高校联合开发的视频世界模型 Mirage 将场景信息直接存储在潜在空间中，而非基于像素的点云。这大幅降低了计算时间和图形显存消耗，同时能在长镜头移动中保持场景空间一致性。不过，该模型目前仍无法可靠地跨片段跟踪运动物体。

## 正文

Microsoft Research's Mirage gives video generation a persistent spatial memory that doesn't forget what's around the corner

Key Points

Mirage, a new video world model from Microsoft Research and several universities, keeps the spatial structure of generated scenes consistent even during long camera movements.

Instead of taking the expensive detour through pixel-based 3D point clouds, the system stores image features directly in a spatial memory within its internal latent space.

Mirage generates videos up to 10.5x faster and uses up to 55x less memory than comparable models. Moving objects are still filtered out of the memory.

Mirage is a new video world model that skips the costly detour through pixel-based memory. That speeds up generation and keeps a scene's spatial structure stable even during long camera moves. Researchers from several universities built it with Microsoft Research.

Video world models turn a starting frame and a camera path into plausible moving images, handy for simulations or as world simulators. But without some kind of memory, even strong generators lose track of space over time. A corner of a room you've already passed looks different when the camera swings back. Furniture shifts, and textures change.

Systems like Voyager, WonderWorld, and Spatia try to fix this with a 3D point cloud that gets fed a steady stream of color data. Every new generation step has to render that cloud and then translate the result back into the model's internal feature space. Microsoft's new paper calls this a double bottleneck: It eats compute, and information leaks out every time the data passes through pixel space.

Mirage takes a different approach. Rather than holding onto visible color points, it stores the internal image features the diffusion model already uses. Each feature gets a spot in 3D space, which turns it into an entry in spatial memory.

To generate a new viewpoint, the model projects this store straight onto the target camera and hands the result to the generator, skipping the step of rendering a point cloud and re-encoding it. The authors say this also slashes memory use, since the data sits in the model's compact internal resolution instead of at full image size.

How the memory grows with each step

Mirage builds videos in segments, seeding the spatial memory from the starting image. For every later segment, the system pulls the relevant data from memory, generates the new frames, then writes their contents back to the cache. The memory keeps growing as it goes.

A filter keeps the system from tripping over itself by stripping out moving objects and the sky before writing, so only stable geometry lands in long-term memory. The researchers built on Alibaba's open-source video model Wan2.2, bolting on a small add-on module that teaches the model to use the new memory, then fine-tuning the whole thing with LoRA adapters.

Faster and lighter than color-based rivals

On the WorldScore benchmark, Mirage beats its closest rival Spatia, which still keeps memory as color points, and leaves general video generators like Wan2.1 and CogVideoX far behind. It shines at holding a scene's spatial structure together and keeping surfaces looking consistent across many frames.

It also leads two of three metrics on the RealEstate10K dataset in the closed-loop test. Here the camera circles back to its starting point, a brutal stress test because every tiny error piles up over the full path.

Efficiency is Mirage's strongest point. Color-based memory scales badly on longer runs and keeps demanding more graphics memory. Mirage's compute cost per frame barely moves after the first segment. The researchers put the total gain at up to 10.57x faster generation and up to 55x less memory than color-based systems.

They're upfront about one catch. Moving objects get dropped at segment boundaries because their geometry can't be trusted, and the filter deliberately tosses them out. Busy scenes gain less from spatial memory than quiet interiors do. The team points to storing dynamic content as the obvious next problem to solve.

You can find more on Mirage on the project page. Microsoft also runs a GitHub repository for Latent Spatial Memory.

Video world models are one of the hottest research areas in AI video right now. Models like Veo mostly produce single, internally consistent clips, while world models try to make a scene navigable and keep it consistent over time. Google Deepmind showed this off recently with Genie 3, which spins up interactive environments in real time and holds them for several minutes. At I/O, Google also pitched Gemini Omni as a world model and the potential successor to its text-to-video model Veo.

AI News Without the Hype – Curated by Humans