# 语音智能体性能基准发布，顶尖模型仅能处理半数真实客服场景

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-05-13 00:19
- AIHOT 分数：62
- AIHOT 链接：https://aihot.virxact.com/items/cmp2ujv7200qmsl1q5rfb0in7
- 原文链接：https://x.com/ArtificialAnlys/status/2054234919887573292

## AI 摘要

Artificial Analysis推出语音智能体基准测试𝜏-Voice，用于评估客服场景中的工具调用与多轮对话能力。测试显示，当前最强语音到语音模型仅能端到端解决约一半的真实任务，与文本智能体存在明显差距。语音通道因口音、噪音、网络问题及需快速响应、保持对话一致性而更具挑战。在模拟航空、零售、电信领域的真实音频条件下，xAI的Grok Voice Think Fast 1.0以52.1%的成功率领先，平均对话时长5.6分钟；OpenAI与Google的模型紧随其后。该基准补充了现有的大规模音频智能测试与对话自然度评估体系。

## 正文

Announcing agentic performance benchmarking for Speech to Speech models on Artificial Analysis. We use τ-Voice to measure tool calling and customer interaction voice agent capabilities in realistic customer service scenarios

Even the strongest Speech to Speech （S2S） models today resolve only about half of realistic customer service scenarios end-to-end - a meaningful gap relative to frontier text-based agents on the same tasks. Voice channels introduce significant complexity： challenging accents， background noise， and packet loss， all while requiring fast responses， consistency across long multi-turn conversations， and reliable tool use. Performance also varies considerably by audio condition： in clean audio some models perform notably better， but realistic conditions continue to pose a challenge. Conversation duration also varies meaningfully across models， with implications for both customer experience and operational cost.

About τ-Voice：
Our Agentic Performance benchmark is based on τ-Voice （Ray， Dhandhania， Barres & Narasimhan， 2026）， which extends τ2-bench into the voice modality to evaluate S2S models on realistic customer service tasks. It measures multi-turn instruction following， support of a simulated customer through a complete interaction， and tool use against simulated customer service systems. The simulated user combines an LLM-driven decision model with realistic audio synthesis： diverse accents， background noise， and packet loss modelled on real network conditions.

This complements our Big Bench Audio benchmark measuring intelligence and Conversational Dynamics （Full Duplex Bench subset） benchmark measuring conversational naturalness. Scores are the average of three independent pass@1 trials. We evaluate under realistic audio conditions using the τ2-bench base task split across three domains：
➤ Airline （50 scenarios）： e.g.， changing a flight， rebooking under policy constraints
➤ Retail （114 scenarios）： e.g.， disputing a charge， processing a return
➤ Telecom （114 scenarios）： e.g.， resolving a billing issue， troubleshooting a service problem

Task success is determined by deterministic checks against expected actions and final database state， consistent with the τ2-bench evaluator.

Key results：
xAI's Grok Voice Think Fast 1.0 is the clear leader at 52.1%， averaging 5.6 minutes per conversation， the second-longest overall. OpenAI's GPT-Realtime-2 （High） （39.8%， 3.0 min） and GPT-Realtime-1.5 （38.8%， 4.8 min） follow， with Gemini 3.1 Flash Live Preview - High close behind at 37.7% （3.8 min）.

Speech to Speech is a fast evolving modality and we expect movement in rankings as we continue to add new models with these capabilities， and model robustness improves.

Congratulations @xAI @elonmusk！ See below for further detail ⬇️