# LongLive-RAG：用于长视频生成的通用检索增强框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpw9qhhb00z4slsnv8a0mzlx
- 原文链接：https://arxiv.org/abs/2606.02553

## AI 摘要

LongLive-RAG旨在解决自回归（AR）视频扩散模型在长视频生成中面临的错误累积与身份漂移问题。该方法将长视频生成建模为检索增强生成（RAG）问题，不再仅依赖滑动窗口，而是把之前生成的潜在变量视为可检索的动态历史记录。在每个新生成块中，它通过查询嵌入检索相关历史潜在变量，使生成器能够利用非局部上下文。为提升检索效果，框架引入了Window Temporal Delta Loss。实验表明，该框架能提升长视频生成质量，在多个AR骨干和生成长度上于VBench-Long基准取得了最佳平均排名。代码已开源。

## 正文

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.
