# SwiftVR：实时一步生成式视频恢复

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmq6ci8g4075rsl5ieuxtopbq
- 原文链接：https://arxiv.org/abs/2606.09516

## AI 摘要

SwiftVR 提出流式一步生成式视频恢复框架，采用无掩码移位窗口自注意力和轻量级恢复感知自编码器，消除二次空间注意力与大型视频自编码器的延迟及内存瓶颈。模型仅用标准密集 SDPA 调用，无需重训练或自定义内核即可部署至消费级 GPU。在单张 H100 上，2560×1440 分辨率达 31 FPS，3840×2160 达 14 FPS，而扩散 VR 基线在 4K 已超出内存。在 RTX 5090 上，1080p 达 26 FPS，为首个实现消费级 GPU 实时 1080p 流媒体的生成式视频恢复模型。

## 正文

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.
