原文 · 未翻译
New open-source voice model listens nonstop and decides every 0.4 seconds whether to speak or stay silent
Key Points
The "Audio Interaction" AI model processes continuous audio streams and combines tasks such as dialog, translation, transcription and sound recognition in a single system.
To do this, it breaks down the audio stream into 0.4-second segments and decides after each segment via a special token whether it should remain silent or generate a response.
Trained with an artificial data set of 302,000 hours of audio, the model processes listening and speaking in parallel. This minimizes the waiting time for responses and allows the system to beat models such as Gemini 3 Flash in proactive noise detection tests.
Researchers want to close the gap between today's audio speech models and real listeners. Their system handles dialog, translation, and sound recognition all at once.
Today's audio voice models, like GPT-4o or Qwen 3.5-Omni, work like a dictation machine with a button: they only respond when the recording ends. Streaming systems like Moshi for dialog or Paraformer for live subtitles do listen in, but they can only handle one task at a time and treat sounds like coughing as background noise.
Researchers from China, Hong Kong, and Singapore want to combine both approaches with "audio interaction." The model listens to an audio stream continuously, breaks it into 0.4-second chunks, and decides after each chunk whether to stay silent or speak. Translation, transcription, chatting, and reacting to everyday noises all run in a single three-billion-parameter model.
One special token every 0.4 seconds
After each audio snippet, the model outputs either or . If it picks , it keeps listening. Only with does it start talking. Classic tasks like "Translate into English" become instructions within the same continuous stream.
According to the paper, Audio-Interaction scored 58.15 points on the audio benchmark MMAU, narrowly beating its base model Qwen2.5-Omni-3B. It also comes close to much larger 7B models. On English-Chinese translation, the model improves a lot over the base.