# MilliVid： 用于视频生成长程一致性的层级潜变量

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmq7h8hbk033usl5wdv0h6g4a
- 原文链接：https://arxiv.org/abs/2606.09056

## AI 摘要

视频生成模型长程一致性因Transformer序列长度过大而困难。MilliVid提出多尺度token空间的粗到细生成：预训练自编码器将每帧压缩为层级token（从典型潜变量分辨率到每帧几个token），最粗层捕获场景布局与语义，细层添加高频外观纹理；再训练视频扩散模型，每步生成精细控制细节等级与上下文，在几何与物体持久性上保持长程一致性，同时减少不必要细节计算开销。在长Minecraft视频数据集上，该方法生成视频显著更一致。

## 正文

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.
