Prompt Relay:面向多事件视频生成的推理时时间控制
阅读原文· arxiv.org视频扩散模型在生成多事件视频时存在时序控制不足和语义纠缠问题。Prompt Relay 是一种推理时即插即用的解决方法,无需修改模型架构或增加计算开销。该技术通过在交叉注意力机制中引入惩罚项,强制各时间段仅关注对应提示,使模型逐次呈现单一语义概念。该方法显著改善时间提示对齐,减少概念干扰,提升生成视频的视觉质量和叙事连贯性。
Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.