Inworld AI发布了新一代实时对话语音模型Realtime TTS-2。该模型的核心突破在于,能在说话前处理完整的多轮对话音频上下文,从而像真人一样实时适应对话情境。其关键特性包括:单一音色支持超过100种语言,首次音频生成延迟低于200毫秒,并能通过自然语言指令调整语音风格,无需预设情感标签。这标志着语音AI首次具备了“聆听”对话整体氛围而不仅是字面内容的能力,其架构设计旨在实现既自然动听又富有情境感知的对话体验。
Inworld AI released Realtime TTS-2, a text-to-speech model that processes the full audio context of multi-turn exchanges before it speaks, adapting to the moment the way a person would.
One voice identity across 100+ languages.
Sub-200ms time-to-first-audio.
Natural-language voice direction, no emotion tag presets.
AI that hears how you sound, not only what you say, is now a real architecture decision.