# RhymeFlow：异步去噪流调度实现训练无关视频生成加速

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：40
- AIHOT 链接：https://aihot.virxact.com/items/cmqeoxx5004awsluniuurptbg
- 原文链接：https://arxiv.org/abs/2606.06309

## AI 摘要

基于扩散Transformer（DiT）的视频生成模型因3D注意力平方复杂度导致高推理延迟。现有加速方法在每个去噪步骤内减少计算，但仍要求所有帧经历完整稠密去噪。RhymeFlow提出训练无关框架，解耦不同帧的去噪轨迹：仅对稀疏关键帧执行逐步骤稠密去噪以保持结构完整性，非关键帧逐步跳过步骤降低计算；同时引入潜在轨迹投影模块，使关键帧与完整时序一致的序列表示交互，避免视觉退化。在现有DiT视频生成模型上，RhymeFlow实现了更高推理速度和更好视觉质量。

## 正文

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce RhymeFlow, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.
