非语言发声中的说话人身份:条件蒸馏与混合专家方法
阅读原文· arxiv.org针对非语言发声(NVV)中说话人身份一致性评估,现有说话人验证(SV)系统泛化差且微调会导致灾难性遗忘。本文提出融合冻结Data2Vec自监督特征与ECAPA-TDNN的框架,并加入带领域感知路由的混合专家(MoE)模块。通过预训练教师模型在语音输入上施加条件蒸馏损失以保持语音验证精度,同时用对比损失弥合语音与NVV的域间差距。该方法将NVV的等错误率(EER)从38.93%降至22.66%,语音EER从13.17%降至9.24%。
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.