WorldKV:通过世界检索与压缩实现高效世界记忆
阅读原文· arxiv.org为解决自回归视频扩散模型在维持持久世界一致性时面临的内存与计算瓶颈,本文提出了WorldKV框架。该框架包含世界检索与压缩两个核心组件,无需额外训练。世界检索通过相机与动作对应关系,按需召回历史KV缓存块并插入当前窗口。世界压缩通过锚帧关键帧相似性剪枝,将缓存存储减半。实验表明,在保持生成保真度的同时,WorldKV将吞吐量提升了约2倍,性能可与需训练的记忆方法竞争。
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/