# On-Policy对抗流蒸馏用于自回归视频生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-25 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpm4ktyn0lihsl01mkcq13ik
- 原文链接：https://arxiv.org/abs/2605.26105

## AI 摘要

提出Adversarial Flow Distillation（AFD）框架，用于将黑盒视频教师模型知识蒸馏至因果自回归学生模型。该方法在相同提示词下并行生成教师与学生输出，训练Bradley-Terry判别器以估计干净样本上的师生差异，并将此在策略优势转化为对学生自身噪声状态的前向过程流匹配更新，无需教师分数、潜变量、去噪轨迹或步长对齐。实验表明，AFD在两个自回归学生系列上持续提升运动与物理敏感的生成质量，同时保持整体效果，为蒸馏专有或异构视频生成器提供了实用路径。

## 正文

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.