# WavAlign：通过自适应混合后训练增强语音对话模型的智能与表现力

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-16 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmoawtkqs04nasl1ywx7ky4xz
- 原文链接：https://arxiv.org/abs/2604.14932

## AI 摘要

研究团队提出WavAlign方法，针对端到端语音对话模型智能与表现力不足的问题，设计了模态感知自适应后训练方案。该方法将偏好更新约束至语义通道，通过显式锚定改善声学行为，并基于rollout统计动态调节混合比例以避免不可靠梯度。在多个语音对话基准测试及代表性架构上的评估显示，模型在语义质量和语音表现力方面均获得一致提升。

## 正文

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.
