# FlowLong：基于流形约束Tweedie匹配的推理时长视频生成方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-20 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpgcijl40e9hsljwyqjrqjgy
- 原文链接：https://arxiv.org/abs/2605.20910

## AI 摘要

针对视频扩散模型生成长序列时质量下降和运动重复的问题，研究提出了一种无需训练的推理方法FlowLong。该方法通过重叠滑动窗口生成长视频，利用Tweedie匹配融合相邻窗口预测样本以保持时间连续性。在高噪声阶段采用随机早期采样同步轨迹，后转为确定性ODE采样保持视觉质量。实验表明该方法能在多种模型上生成数倍长度的视频，在时间一致性和视觉质量上超越现有基线，并可扩展至音视频生成与3DGS任务。

## 正文

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via Tweedie matching to enforce both manifold constraint and temporal consistency across overlap regions. Stochastic early-phase sampling then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.
