# 基于自回归扩散Transformer的流式同步空间音频生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmpupuvkx007ysl3tuvuf64g3
- 原文链接：https://arxiv.org/abs/2605.30940

## AI 摘要

针对实时高质量空间音频生成中存在的质量与延迟权衡及多模态空间信息捕捉难题，本文提出了SwanSphere。这是一个统一的流式框架，用于从全景视频和文本提示生成高保真空间音频。其核心贡献在于：1) 提出了一种因果自回归扩散Transformer架构，实现了流式的高质量生成；2) 设计了空间视频-音频对比学习策略来对齐视频编码器与声学域，并结合多目标在线直接偏好优化，增强了空间感知与多模态合成的鲁棒性；3) 开发了一个自动标注流程，用于生成详细的空间描述，以缓解数据稀缺问题。实验证明，SwanSphere在视频到空间音频和文本到空间音频任务中均表现优异。

## 正文

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.