Audio Interaction：开源语音模型持续监听，每0.4秒决定是否说话或保持沉默

2026-06-06 18:50·26天前·Jonathan Kemper

AI 摘要

Audio Interaction是一个开源语音模型，持续监听环境，每0.4秒判断是否应该说话或保持沉默。与GPT-4o或Qwen3.5-Omni不同，它无需等待录音结束，可在单个流中同时进行翻译、转录、对话并识别咳嗽等日常噪音。代码和模型权重已在GitHub上以Apache 2.0开源许可发布，训练数据稍后公布。

原文 · 未翻译

New open-source voice model listens nonstop and decides every 0.4 seconds whether to speak or stay silent

Key Points

The "Audio Interaction" AI model processes continuous audio streams and combines tasks such as dialog, translation, transcription and sound recognition in a single system.

To do this, it breaks down the audio stream into 0.4-second segments and decides after each segment via a special token whether it should remain silent or generate a response.

Trained with an artificial data set of 302,000 hours of audio, the model processes listening and speaking in parallel. This minimizes the waiting time for responses and allows the system to beat models such as Gemini 3 Flash in proactive noise detection tests.

Researchers want to close the gap between today's audio speech models and real listeners. Their system handles dialog, translation, and sound recognition all at once.

Today's audio voice models, like GPT-4o or Qwen 3.5-Omni, work like a dictation machine with a button: they only respond when the recording ends. Streaming systems like Moshi for dialog or Paraformer for live subtitles do listen in, but they can only handle one task at a time and treat sounds like coughing as background noise.

Researchers from China, Hong Kong, and Singapore want to combine both approaches with "audio interaction." The model listens to an audio stream continuously, breaks it into 0.4-second chunks, and decides after each chunk whether to stay silent or speak. Translation, transcription, chatting, and reacting to everyday noises all run in a single three-billion-parameter model.

One special token every 0.4 seconds

After each audio snippet, the model outputs either or . If it picks , it keeps listening. Only with does it start talking. Classic tasks like "Translate into English" become instructions within the same continuous stream.

According to the paper, Audio-Interaction scored 58.15 points on the audio benchmark MMAU, narrowly beating its base model Qwen2.5-Omni-3B. It also comes close to much larger 7B models. On English-Chinese translation, the model improves a lot over the base.

The Decoder：AI News（RSS）

68导出 Markdown