AI 摘要
Artificial Analysis 发布首个语音到语音(S2S)模型智能体性能基准测试𝜏-Voice,模拟包含口音、噪音和网络丢包的复杂客服场景。测试显示,当前最强S2S模型仅能端到端解决约一半的真实任务,与顶尖文本智能体存在差距。xAI的Grok Voice Think Fast 1.0以52.1%的成功率领先,平均对话时长5.6分钟;OpenAI的GPT-Realtime系列与谷歌的Gemini紧随其后。该领域发展迅速,排名可能随模型更新而变动。
Grok Voice is #1!
Announcing agentic performance benchmarking for Speech to Speech models on Artificial Analysis. We use τ-Voice to measure tool calling and customer interaction ...