# PermaVid：通过解耦上下文记忆实现编辑间一致的视频生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：53
- AIHOT 链接：https://aihot.virxact.com/items/cmqg9fubc00mzsl9sb508g78d
- 原文链接：https://arxiv.org/abs/2606.16449

## AI 摘要

PermaVid提出一种多模态上下文记忆框架，将空间上下文解耦为语义外观与几何结构，分别存储于RGB上下文记忆和深度上下文记忆。结合编辑感知的记忆更新与检索策略，使记忆演化与后续观测对齐。在编辑操作修改场景外观或布局后，该框架仍能保持生成视频在时间与视角上的长期语义与结构一致性，显著超越现有方法。

## 正文

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.