# Vera：用于内容保留视频编辑的分层扩散模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-22 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmqqu3riv0dfislp5l35kny4r
- 原文链接：https://arxiv.org/abs/2606.23610

## AI 摘要

Vera 是一种分层扩散模型，专为内容保留视频编辑设计。它生成编辑层及 alpha 遮罩，与源视频合成，从而分离创意编辑与内容保留。架构采用混合 Transformer（MoT），各层独立 DiT 通过联合自注意力交互。训练使用高质量分层数据集，含精确 alpha 遮罩和多样场景。定量基准和人类偏好显示，Vera 在内容保留上优于开源模型，编辑质量有竞争力，仅使用 486K 帧分层训练数据。

## 正文

Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.
