Artificial Analysis@ArtificialAnlys

2026-05-28 23:34·35天前

AI 摘要

AA-WER Streaming是一个新基准，用于测量流式语音转文本模型在语音智能体场景下的准确率与延迟。该测试基于约8小时音频，报告词错误率与延迟。关键结果显示：Cartesia Ink-2（语义端点）在最终转录中准确率最高（WER 3.59%，延迟0.21秒）；ElevenLabs Scribe v2 Realtime在首次部分转录中准确率最高（WER 3.65%，延迟0.13秒）；Deepgram Flux在速度上领先，最终和首次部分转录延迟分别为0.020秒和0.019秒。

Announcing AA-WER Streaming， our new benchmark measuring streaming Speech to Text models on accuracy and latency for voice agent use cases. Pareto optimal models on this new benchmark include those from Cartesia， ElevenLabs， and Deepgram

Streaming Speech to Text （STT） powers real-time transcription in voice agents and live captioning， where models must balance accuracy against speed. Fast transcripts are especially important for keeping responses feeling natural and leaves more of the response-time budget for reasoning and tool calls. Accuracy also matters since transcription errors compound in downstream reasoning and speech generation.

Streaming STT models transcribe audio as it is fed in， sharing outputs continuously， unlike offline （batch） models that process the entire file at once and are typically slower.

What we measure： AA-WER Streaming reports Word Error Rate and latency together， measured from the moment end of speech is detected， with a Pareto line of increasing accuracy as time to transcript received increases. For direct comparability to offline models on accuracy， we test these streaming models on the same ~8 hours of audio as our offline benchmark， AA-WER v2.0： AA-AgentTalk， Earnings22-Cleaned-AA， VoxPopuli-Cleaned-AA.

We measure WER and latency as paired metrics at two points after Silero VAD-detected end of speech： First Final Transcription： WER is measured on the first final-denoted transcript returned after end of speech is detected. Latency is the time in seconds from end of speech to that final-denoted transcript. This is more useful for understanding performance as a standalone streaming transcription model， and for higher accuracy. First Partial Transcription： WER is measured on the first transcript-bearing event （partial or final） returned after end of speech is detected. Latency is the time in seconds from end of speech to that first transcript event. This is more useful for near instantaneous transcription for lower-accuracy tasks like responding to "yes" or "no" questions， or for speculative decoding.

Artificial Analysis@ArtificialAnlys · X

70导出 Markdown

2026-05-28 23:34·35天前

在 X 看原推· x.com

AI 摘要