Open ASR 排行榜新增多语言与长格式赛道,揭示模型性能新挑战
阅读原文· huggingface.coASR排行榜新增多语言和长形式评估,助力开发者优化语音应用。
Hugging Face 的 Open ASR 排行榜新增多语言和长格式语音识别评估赛道。多语言赛道涵盖8种语言,长格式赛道则测试模型处理连续数分钟语音的能力。新榜单显示,领先模型在多语言任务上的词错误率平均比专用单语模型高约15%,在长格式任务上错误率可能上升超20%,凸显了模型在实际应用中的泛化能力仍面临严峻挑战。
Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks
Most benchmarks focus on short-form English transcription (<30s), and overlook other important tasks, such as (1) multilingual performance and (2) model throughput, which can a be deciding factor for long-form audio like meetings and podcasts.
Over the past two years, the Open ASR Leaderboard has become a standard for comparing open and closed-source models on both accuracy and efficiency. Recently, multilingual and long-form transcription tracks have been added to the leaderboard 🎉
TL;DR - Open ASR Leaderboard
- 📝 New preprint on ASR trends from the leaderboard: https://hf.co/papers/2510.06961
- 🧠 Best accuracy: Conformer encoder + LLM decoders (open-source ftw 🥳)
- ⚡ Fastest: CTC / TDT decoders
- 🌍 Multilingual: Comes at the cost of single-language performance
- ⌛ Long-form: Closed-source systems still lead (for now 😉)
- 🧑💻 Fine-tuning guides (Parakeet, Voxtral, Whisper): to continue pushing performance
Takeaways from 60+ models
As of 21 Nov 2025, the Open ASR Leaderboard compares 60+ open and closed-source models from 18 organizations, across 11 datasets.
In a recent preprint, we dive into the technical setup and highlight some key trends in modern ASR. Here are the big takeaways 👇
1. Conformer encoder 🤝 LLM decoder tops the charts 📈
Models combining Conformer encoders with large language model (LLM) decoders currently lead in English transcription accuracy. For example, NVIDIA’s Canary-Qwen-2.5B, IBM’s Granite-Speech-3.3-8B, and Microsoft’s Phi-4-Multimodal-Instruct achieve the lowest word error rates (WER), showing that integrating LLM reasoning can significantly boost ASR accuracy.
💡 Pro-tip: NVIDIA introduced Fast Conformer, a 2x faster variant of the Conformer, that is used in their Canary and Parakeet suite of models.
2. Speed–accuracy tradeoffs ⚖️
While highly accurate, these LLM decoders tend to be slower than simpler approaches. On the Open ASR Leaderboard, efficiency is measured using inverse real-time factor (RTFx), where higher is better.
For even faster inference, CTC and TDT decoders deliver 10–100× faster throughput, albeit with slightly higher error rates. This makes them ideal for real-time, offline, or batch transcription tasks (such as meetings, lectures, or podcasts).
3. Multilingual 🌍
OpenAI’s Whisper Large v3 remains a strong multilingual baseline, supporting 99 languages. However, fine-tuned or distilled variants like Distil-Whisper and CrisperWhisper often outperform the original on English-only tasks, showing how targeted fine-tuning can improve specialization (how to fine-tune? Check out guides for Whisper, Parakeet, and Voxtral).
That said, focusing on English tends to reduce multilingual coverage 👉 a classic case of the tradeoff between specialization and generalization. Similarly, while self-supervised systems like Meta’s Massively Multilingual Speech (MMS) and Omnilingual ASR can support 1K+ languages, they trail behind language-specific encoders in accuracy.
⭐ While just five languages are currently benchmarked, we’re planning to expand to more languages and are excited for new dataset and models contributions to multilingual ASR through GitHub pull requests.
🎯 Alongside multilingual benchmarks, several community-driven leaderboards focus on individual languages. For example, the Open Universal Arabic ASR Leaderboard compares models across Modern Standard Arabic and regional dialects, highlighting how speech variation and diglossia challenge current systems. Similarly. the Russian ASR Leaderboard provides a growing hub for evaluating encoder-decoder and CTC models on Russian-specific phonology and morphology. These localized efforts mirror the broader multilingual leaderboard’s mission to encourage dataset sharing, fine-tuned checkpoints, and transparent model comparisons, especially in languages with fewer established ASR resources.
4. Long-form transcription is a different game ⏳
For long-form audio (e.g., podcasts, lectures, meetings), closed-source systems still edge out open ones. It could be due to domain tuning, custom chunking, or production-grade optimization.
Among open models, OpenAI’s Whisper Large v3 performs the best. But for throughput, CTC-based Conformers shine 👉 for example, NVIDIA’s Parakeet CTC 1.1B achieves an RTFx of 2793.75, compared to 68.56 for Whisper Large v3, with only a moderate WER degradation (6.68 and 6.43 respectively).
The tradeoff? Parakeet is English-only, again reminding us of that multilingual and specialization tradeoff 🫠.
⭐ While closed systems still lead, there’s huge potential for open-source innovation here. Long-form ASR remains one of the most exciting frontiers for the community to tackle next!
🎤 The Show Must Go On
Given how fast ASR is evolving, we’re excited to see what new architectures push performance and efficiency, and how the Open ASR Leaderboard continues to serve as a transparent, community-driven benchmark for the field, and as a reference for other leaderboards (Russian, Arabic, and Speech DeepFake Detection).
We’ll keep expanding the Open ASR LeaderBoard with more models, more languages, and more datasets so stay tuned 👀
👉 Want to contribute? Head on over to the GitHub repo to open a pull request 🚀
Models mentioned in this article 8
Spaces mentioned in this article 4
Papers mentioned in this article 4
Community
The tradeoff? Parakeet is English-only...
....
Given how fast ASR is evolving
So much so this sentence is already out of date.
The latest version of parakeet (v3) now supports 25 languages.
Thanks for the comment! Yes we should have specified Parakeet v2. The latest Parakeet v3 is in the main leaderboard (not yet in long form @Steveeeeeeen ), and has similar RTFx so there wouldn't be trade-off on multilingual (but dip in English performance).
First of all thanks for the leaderboard, very useful ressource.
It would be a very nice addition to have an "efficiency" column that directly shows the AverageWER / RTFx ratio, or a 2D plot of it, showing the "Pareto frontier", as is often done for LLMs nowadays.
Technology changes humanity
This breakdown captures how rapidly ASR is evolving, especially the clear divide between LLM-enhanced accuracy and the blazing throughput of CTC/TDT systems. The post also highlights the ongoing tradeoff between multilingual coverage and single-language specialization, which is crucial for real-world deployment scenarios like Fastest Trains or long-form transcription workflows. With the Open ASR Leaderboard adding multilingual and long-audio tracks, it’s becoming one of the most reliable ways to gauge practical ASR performance today.
why there is no models from China?
not intentional on our part and most models are contributed from the community! if you think an important model is missing, please do open a PR 🙂 https://github.com/huggingface/open_asr_leaderboard?tab=readme-ov-file#add-a-new-model
Great stuff, thank you! How do you view the longform results on the board?
The latest updates to the Open ASR Leaderboard are particularly interesting because they expand evaluation beyond traditional short-form English speech recognition tasks. The addition of multilingual and long-form tracks provides a more realistic benchmark for modern ASR systems, especially as real-world applications increasingly involve multiple languages, accents, and extended conversations.
One trend that stands out is how some models perform exceptionally well on short benchmarks but show noticeable differences when evaluated on longer audio segments. Long-form transcription introduces challenges such as speaker consistency, context retention, punctuation accuracy, and error accumulation over time. Similarly, multilingual evaluation highlights the importance of robust language coverage rather than optimization for a single language.
I'm curious to hear what others think about these new tracks and whether they better reflect practical ASR use cases. I've been following developments in speech recognition and AI benchmarking through my research and articles on spotifyipa, where I often explore how benchmark results translate into real-world performance.
· or to comment