# MemLearner：为视频世界模型学习查询上下文记忆

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-30 08:00
- AIHOT 分数：38
- AIHOT 链接：https://aihot.virxact.com/items/cmr1imjwa036pslnlpukzs0y0
- 原文链接：https://arxiv.org/abs/2606.31734

## AI 摘要

视频世界模型在长时段生成中缺乏记忆，导致场景不一致。MemLearner 提出基于学习的自适应上下文查询方法，利用 query tokens 桥接上下文与预测 token，并借助视频生成模型自身的预训练视觉先验进行上下文查询，无需从头训练额外模块。团队收集了带场景遮挡和动态物体的长视频数据集，辅以相机位姿标注，并采用多数据集训练策略同时利用标注渲染视频和无标注真实视频。实验表明，MemLearner 在场景一致性和记忆方面显著优于以往视频世界模型，尤其在遮挡和动态场景下。

## 正文

Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.