Artificial Analysis@ArtificialAnlys

2026-06-02 00:55·31天前

AI 摘要

Artificial Analysis 团队推出 AA-WER Streaming 基准，用于评估流式语音转文本模型在语音智能体场景中的表现，主要考察准确性与延迟。流式模型需要在这两者间取得平衡。测评结果显示，Cartesia Ink-2 在最终转录准确性上领先，词错率为 3.59%，延迟为 210ms；ElevenLabs Scribe v2 Realtime 以 3.64% 词错率和 140ms 延迟紧随其后；Deepgram Flux 延迟最低（约 20ms），但词错率为 7.36%。这三家模型处于准确性-延迟帕累托前沿。

Overview of our recently launched AA-WER Streaming benchmark， measuring streaming Speech to Text models on accuracy and latency for voice agent use cases

Streaming Speech to Text （STT） powers real-time transcription in voice agents and live captioning， where models must balance accuracy against speed. Fast transcripts keep responses feeling natural and free up the response-time budget for reasoning and tool calls. Accuracy matters too， since errors can compound downstream.

Streaming STT models transcribe audio as it is fed in， sharing outputs continuously， unlike offline （batch） models that process the entire file at once and are typically slower.

Models from Cartesia， ElevenLabs， and Deepgram sit on the accuracy-latency Pareto frontier. Cartesia Ink-2 leads on final transcript accuracy at 3.59% WER （210ms）， closely followed by ElevenLabs Scribe v2 Realtime at 3.64% WER （140ms）. Deepgram Flux is fastest at ~20ms on final transcript latency （7.36% WER）.

In this video， Kiriill Butler， Member of Technical Staff at Artificial Analysis， walks through the benchmark and key results.

评测/基准语音

在 X 查看原推

Artificial Analysis@ArtificialAnlys · X

61导出 Markdown

2026-06-02 00:55·31天前

在 X 看原推· x.com

AI 摘要

Overview of our recently launched AA-WER Streaming benchmark， measuring streaming Speech to Text models on accuracy and latency for voice agent use cases

Streaming STT models transcribe audio as it is fed in， sharing outputs continuously， unlike offline （batch） models that process the entire file at once and are typically slower.