UnityShots:记忆驱动的多镜头音视频生成系统
阅读原文· arxiv.orgUnityShots基于LTX-2.3构建,通过记忆驱动实现多镜头音视频生成。视频流维护两个固定大小的记忆插槽:长期记忆(LTM)锚定开场镜头,短期记忆(STM)保存前一段尾部,由边界条件门控(融合视觉剪辑概率与节拍跟踪器信号)在每次剪辑时更新。音频流在每镜头注入参考说话者token以保持音色。离散剪辑类型先验通过AdaLN学习,推理时可调节过渡强度。团队发布包含200个多文化多镜头序列的基准,覆盖6个种族区域和10+语言,附有每镜头参考身份、参考音频和边界标签。在I2V、T2V和R2V条件下,UnityShots在所有跨镜头一致性指标上领先开源基线,并在多镜头维度匹配最强闭源系统。
Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.