Artificial Analysis 团队推出 AA-WER Streaming 基准,用于评估流式语音转文本模型在语音智能体场景中的表现,主要考察准确性与延迟。流式模型需要在这两者间取得平衡。测评结果显示,Cartesia Ink-2 在最终转录准确性上领先,词错率为 3.59%,延迟为 210ms;ElevenLabs Scribe v2 Realtime 以 3.64% 词错率和 140ms 延迟紧随其后;Deepgram Flux 延迟最低(约 20ms),但词错率为 7.36%。这三家模型处于准确性-延迟帕累托前沿。
Overview of our recently launched AA-WER Streaming benchmark, measuring streaming Speech to Text models on accuracy and latency for voice agent use cases
Streaming Speech to Text (STT) powers real-time transcription in voice agents and live captioning, where models must balance accuracy against speed. Fast transcripts keep responses feeling natural and free up the response-time budget for reasoning and tool calls. Accuracy matters too, since errors can compound downstream.
Streaming STT models transcribe audio as it is fed in, sharing outputs continuously, unlike offline (batch) models that process the entire file at once and are typically slower.