# 发布AA-WER Streaming：测量语音智能体场景下流式语音转文本模型的新基准

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-05-28 23:34
- AIHOT 分数：70
- AIHOT 链接：https://aihot.virxact.com/items/cmppo38xu01vislvysjobnnvm
- 原文链接：https://x.com/ArtificialAnlys/status/2060021901234458958

## AI 摘要

AA-WER Streaming是一个新基准，用于测量流式语音转文本模型在语音智能体场景下的准确率与延迟。该测试基于约8小时音频，报告词错误率与延迟。关键结果显示：Cartesia Ink-2（语义端点）在最终转录中准确率最高（WER 3.59%，延迟0.21秒）；ElevenLabs Scribe v2 Realtime在首次部分转录中准确率最高（WER 3.65%，延迟0.13秒）；Deepgram Flux在速度上领先，最终和首次部分转录延迟分别为0.020秒和0.019秒。

## 正文

Announcing AA-WER Streaming， our new benchmark measuring streaming Speech to Text models on accuracy and latency for voice agent use cases. Pareto optimal models on this new benchmark include those from Cartesia， ElevenLabs， and Deepgram

Streaming Speech to Text （STT） powers real-time transcription in voice agents and live captioning， where models must balance accuracy against speed. Fast transcripts are especially important for keeping responses feeling natural and leaves more of the response-time budget for reasoning and tool calls. Accuracy also matters since transcription errors compound in downstream reasoning and speech generation.

Streaming STT models transcribe audio as it is fed in， sharing outputs continuously， unlike offline （batch） models that process the entire file at once and are typically slower.

What we measure：
AA-WER Streaming reports Word Error Rate and latency together， measured from the moment end of speech is detected， with a Pareto line of increasing accuracy as time to transcript received increases. For direct comparability to offline models on accuracy， we test these streaming models on the same ~8 hours of audio as our offline benchmark， AA-WER v2.0： AA-AgentTalk， Earnings22-Cleaned-AA， VoxPopuli-Cleaned-AA.

We measure WER and latency as paired metrics at two points after Silero VAD-detected end of speech：
First Final Transcription： WER is measured on the first final-denoted transcript returned after end of speech is detected. Latency is the time in seconds from end of speech to that final-denoted transcript. This is more useful for understanding performance as a standalone streaming transcription model， and for higher accuracy.
First Partial Transcription： WER is measured on the first transcript-bearing event （partial or final） returned after end of speech is detected. Latency is the time in seconds from end of speech to that first transcript event. This is more useful for near instantaneous transcription for lower-accuracy tasks like responding to "yes" or "no" questions， or for speculative decoding.

Key results：
➤ Highest accuracy on Final after End of Speech： @Cartesia Ink-2 （semantic endpoints） at 3.59% WER， 0.21s latency， followed by ElevenLabs Scribe v2 Realtime （3.64%， 0.14s） and Cartesia Ink-2 （external endpoints） （3.66%， 0.09s）
➤ Highest accuracy on First Partial after End of Speech： @ElevenLabs Scribe v2 Realtime at 3.65% WER， 0.13s latency， followed by Cartesia Ink-2 （external endpoints） （4.33%， 0.07s） and @AssemblyAI U3 Realtime Pro （4.46%， 0.47s）
➤ Fastest transcription： @DeepgramAI Flux leads both Final and Partial at 0.020s and 0.019s respectively （both 7.36% WER）. On Final， it's followed by @soniox_ai Realtime and Deepgram Nova-3 Realtime （both 0.06s）； on First Partial， it's followed by @NVIDIA Nemotron 3 ASR 80ms （0.04s） and Soniox Realtime （0.05s）

Charts below include a Pareto frontier of accuracy vs. speed， so you can shortlist the models that best fit your latency constraints while still achieving high accuracy. See below for further detail ⬇️
