# StepAudio 2.5 TTS跻身语音合成榜前三

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-05-09 07:56
- AIHOT 分数：67
- AIHOT 链接：https://aihot.virxact.com/items/cmoxlluq202n1sllh599hi4qh
- 原文链接：https://x.com/ArtificialAnlys/status/2052900578502885645

## AI 摘要

StepFun的StepAudio 2.5 TTS在Artificial Analysis语音竞技场排行榜位列第三，仅次于Inworld Realtime TTS 1.5 Max和Google Gemini 3.1 Flash TTS。该模型Elo评分达1187分，在测试集上已超越Eleven v3，语音自然度显著提升。其定价为每百万字符85美元，高于头部竞品；生成速度为每秒37.6字符。模型提供全局上下文提示和行内情感标签两种控制路径，可精细调节语音风格与韵律。

## 正文

StepFun's new StepAudio 2.5 TTS ranks #3 on the Artificial Analysis Speech Arena Leaderboard， only behind Inworld's Realtime TTS 1.5 Max and Google's Gemini 3.1 Flash TTS

StepAudio 2.5 TTS represents a significant step forward for StepFun from previous TTS models， with notably increased naturalness of speech samples. The model now edges out Eleven v3 on our current prompt set with an Elo score of 1，187.

Key takeaways：
➤ Quality： StepAudio 2.5 TTS has an Elo of 1，187 based on 834 arena appearances， placing it 28 points behind the leading model （Inworld TTS 1.5 Max at 1，215） and 8 points ahead of Eleven v3 at 1，179
➤ Pricing： Model is priced at $85/1M characters， a premium to leading frontier models， Inworld TTS 1.5 Max at $35/1M and Gemini 3.1 Flash TTS at $36.6/1M ➤ Speed： Model generates characters 37.6 characters per second， compared to 220.5 chars/s for Inworld TTS 1.5 Max and 30.1 chars/s for Gemini 3.1 Flash TTS ➤ Prompting： StepAudio 2.5 TTS offers two paths to control delivery of speech： 1. Global context prompt for overall style， 2. Inline contextual tags for more granular emotion and prosody

See more details and listen to samples below ⬇️