# 交错式语音语言模型在文本中隐式工作

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-21 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmr0bp6x700dgslolxvwdirgf
- 原文链接：https://arxiv.org/abs/2606.22473

## AI 摘要

通过logit lens分析不同家族和规模的交错式语音语言模型，发现模型在中间层隐式地将语音转录为文本token——77%的数据中目标语音对应的文本词出现在候选词前列，随后模型在文本空间中预测下一个词再转回语音域。这一行为并非源自语音识别训练，交错数据和文本LM初始化是诱发该机制的关键因素。

## 正文

Speech language models (SLMs) have been extensively studied, with the common paradigm incorporating text data and pre-trained text LMs. A leading approach is speech-text interleaving in which models are trained over sequences containing both speech and text tokens, aiming to boost even speech-only capabilities. Yet the way these two modalities interact in the model latent space remains unclear. In this work, we analyze interleaved speech-text LMs from different model families and sizes through the scope of the logit lens to provide such insight. We reveal that these models go through an implicit transcription phase in which the text token of the spoken word becomes decodable in intermediate layers, despite not being trained for speech recognition. The transcription of the word appears as one of the top candidate words for as much as 77\% of the data. Following this stage, the models proceed to predict the next word in the text space before transforming back to the speech domain. We finally analyze the role of interleaving data, and initializing from text LMs in eliciting this behavior, as well as seeing how this correlates with spoken knowledge abilities. Our analysis sheds light on the internal mechanisms underlying the relationship between speech and text modalities and could shape SLM optimization.
