Agentic ASR:面向类人交互式语音识别的智能体校正与语义评估
阅读原文· arxiv.org针对单遍语音识别难以纠正语义关键错误的问题,研究者提出 Agentic ASR 闭环框架,将单遍 ASR 前端与语义校正、意图路由、基于推理的编辑整合,并将交互式语音识别建模为多轮校正任务。同时引入句子级语义错误率(S²ER)作为基于大语言模型的语义评估指标,并构建交互模拟系统用于可扩展、可复现的基准测试。在多语言、命名实体密集及代码切换基准上,迭代交互持续降低语义错误,S²ER 改善幅度远大于传统 token 级指标。人类-AI 对齐和消融研究验证了语义评估器的可靠性与框架的鲁棒性。代码和在线演示已公开。
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/