# 稀疏自编码器解释与操控文本转语音语言模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-09 02:09
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmq7u3iet01pdslepkiqm7y1o
- 原文链接：https://arxiv.org/abs/2606.10029

## AI 摘要

研究在CosyVoice3的语言模型骨干上训练BatchTopK稀疏自编码器，并引入模态感知自动解释管道，为每个特征标注其触发来源（文本前缀、1秒语音片段或两者）。恢复的特征涵盖音素、笑声、口音提示和说话者性别，可解释性强。通过SAE潜空间进行操控表明这些特征具有因果性：定向干预使笑声概率从0.02升至0.79，翻转感知的说话者性别，并在保留口语内容的同时控制语速。SAE特征既可作为可解释性对象，也可作为TTS合成的控制方向。

## 正文

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.
