MoVE：基于发声专家混合架构在语音到语音翻译中还原哭笑等非语言情感

2026-04-19 08:00·75天前

AI 摘要

现有语音到语音翻译系统常剥离笑声、哭声等非语言发声，严重限制实用性。研究团队提出MoVE架构，采用Mixture-of-LoRA-Experts设计和软加权路由器捕捉混合情感状态，仅需30分钟精选数据即可训练。在英汉翻译任务中，MoVE在76%的情况下成功重现目标非语言发声，显著优于现有系统最高14%的保留率，并获得最高的人类评分自然度与情感保真度。

原文 · 未翻译

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

MoVE：基于发声专家混合架构在语音到语音翻译中还原哭笑等非语言情感

2026-04-19 08:00·75天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译