# WavFlow： 波形空间中的音频生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-18 08:00
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmpd29t5j00b9slk1v4knaigx
- 原文链接：https://arxiv.org/abs/2605.18749

## AI 摘要

WavFlow挑战了音频生成依赖潜空间压缩的范式，提出了一种直接在原始波形空间生成高保真音频的框架。为解决高维信号建模难题，方法将音频重塑为二维令牌网格并引入幅度提升，结合流匹配的直接预测实现稳定优化。通过自动化管线构建500万高质量三元组数据集，模型从零学习细粒度声学特征。实验显示，WavFlow在视频到音频（VGGSound）和文本到音频（AudioCaps）基准上达到与主流潜空间方法相当甚至更优的性能，证明了中间压缩并非必要，为多模态音频生成提供了更简洁可扩展的路径。

## 正文

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.
