SANA-Streaming：基于混合扩散Transformer的实时流式视频编辑

2026-05-28 08:00·36天前

AI 摘要

SANA-Streaming是一个为消费级GPU设计的高分辨率实时流式视频到视频编辑框架。其核心包含三点：采用混合Diffusion Transformer架构，结合softmax注意力与线性层效率；引入Cycle-Reverse Regularization训练策略，通过从生成内容预测源帧提升时序一致性；以及结合针对NVIDIA Blackwell（RTX 5090）优化的融合GDN内核与混合精度量化（MPQ）实现的高效系统协同设计。该系统在单张RTX 5090上可实现1280x704分辨率、端到端24FPS的实时编辑，其DiT核心可达58FPS。实验表明，其在时序连贯性和系统吞吐量上显著优于现有SOTA方法。

原文 · 未翻译

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

HuggingFace Daily Papers（社区热门论文）

54导出 Markdown

SANA-Streaming：基于混合扩散Transformer的实时流式视频编辑

2026-05-28 08:00·36天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译