# Audio Interaction：开源语音模型持续监听，每0.4秒决定是否说话或保持沉默

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-06-06 18:50
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmq28qhlu0046slopub7yad3n
- 原文链接：https://the-decoder.com/new-open-source-voice-model-listens-nonstop-and-decides-every-0-4-seconds-whether-to-speak-or-stay-silent

## AI 摘要

Audio Interaction是一个开源语音模型，持续监听环境，每0.4秒判断是否应该说话或保持沉默。与GPT-4o或Qwen3.5-Omni不同，它无需等待录音结束，可在单个流中同时进行翻译、转录、对话并识别咳嗽等日常噪音。代码和模型权重已在GitHub上以Apache 2.0开源许可发布，训练数据稍后公布。

## 正文

New open-source voice model listens nonstop and decides every 0.4 seconds whether to speak or stay silent

Key Points

The "Audio Interaction" AI model processes continuous audio streams and combines tasks such as dialog, translation, transcription and sound recognition in a single system.

To do this, it breaks down the audio stream into 0.4-second segments and decides after each segment via a special token whether it should remain silent or generate a response.

Trained with an artificial data set of 302,000 hours of audio, the model processes listening and speaking in parallel. This minimizes the waiting time for responses and allows the system to beat models such as Gemini 3 Flash in proactive noise detection tests.

Researchers want to close the gap between today's audio speech models and real listeners. Their system handles dialog, translation, and sound recognition all at once.

Today's audio voice models, like GPT-4o or Qwen 3.5-Omni, work like a dictation machine with a button: they only respond when the recording ends. Streaming systems like Moshi for dialog or Paraformer for live subtitles do listen in, but they can only handle one task at a time and treat sounds like coughing as background noise.

Researchers from China, Hong Kong, and Singapore want to combine both approaches with "audio interaction." The model listens to an audio stream continuously, breaks it into 0.4-second chunks, and decides after each chunk whether to stay silent or speak. Translation, transcription, chatting, and reacting to everyday noises all run in a single three-billion-parameter model.

One special token every 0.4 seconds

After each audio snippet, the model outputs either or . If it picks , it keeps listening. Only with does it start talking. Classic tasks like "Translate into English" become instructions within the same continuous stream.

According to the paper, Audio-Interaction scored 58.15 points on the audio benchmark MMAU, narrowly beating its base model Qwen2.5-Omni-3B. It also comes close to much larger 7B models. On English-Chinese translation, the model improves a lot over the base.

For the model to learn when to step in, the team needed the right training data. Existing audio datasets consist of short, isolated clips and lack long sequences with sparse response signals, the researchers say.

So they built their own scenes in three stages. First, a language model designed a plausible setting—say, a kitchen in the morning—with three to 15 sub-events. The system then searched a database for matching clips or had missing sounds like breaking glass created by audio models like AudioX or ElevenLabs. A preprocessing step then smoothed out the cut edges so the recordings sounded natural.

The resulting StreamAudio-2M dataset contains 2.6 million units and about 302,000 hours of audio across seven skill areas and 28 subtasks.

Two recurring streaming problems

Two weaknesses kept showing up during training. First, the model forgot earlier content in long, noisy sequences. The fix: asking questions that point back to passages from much earlier in the audio, forcing the model to build up long-term memory.

Second, the model fired too often on sounds that didn't matter. The team countered this with large amounts of verified silence and background audio that's explicitly not supposed to trigger a response. On the newly introduced ProactiveSound Bench with 644 human-curated events the model beats Gemini 3 Flash, Kimi-Audio-Instruct, and Step-Audio 2, among others.

A queue instead of a blocking pipeline

For real-time use, the researchers split incoming audio processing from response generation. Both run in parallel and swap data through a queue: the audio side keeps writing new chunks, and the response side only reads them when it has nothing to say. Without this split, time-to-first-response jumped from 392 to 831 milliseconds, and the system got stuck 5.2 percent of the time.

The 0.4-second chunk size is a tradeoff. At 0.2 seconds, there isn't enough context and the model falls apart in dialog. At 0.8 seconds, latency climbs to 786 milliseconds.

Code and instructions for downloading the weights are on GitHub under the Apache 2.0 license, with no restrictions on commercial use. The full training dataset is set to follow later.

AI News Without the Hype – Curated by Humans