# StepFun StepAudio 2.5 TTS 在语音竞技场排名第三，质量提升但定价偏高

- 来源：Artificial Analysis (@ArtificialAnlys)
- 发布时间：2026-05-09 08:26
- AIHOT 分数：62
- AIHOT 链接：https://aihot.virxact.com/items/cmoxmofy202xhsllh0ouzkfpu
- 原文链接：https://x.com/ArtificialAnlys/status/2052908086919397419

## AI 摘要

StepFun 推出的 StepAudio 2.5 TTS 模型在 Artificial Analysis 语音竞技场排行榜中位列第三，仅次于 Inworld Realtime TTS 1.5 Max 和 Google Gemini 3.1 Flash TTS。该模型语音自然度显著提升，以 1187 的 Elo 评分超越 Eleven v3。其定价为每百万字符 85 美元，高于领先模型；生成速度为每秒 37.6 字符，介于竞品之间。模型提供全局上下文提示和行内情感标签两种控制语音表现的方式。

## 正文

StepFun's new StepAudio 2.5 TTS ranks #3 on the Artificial Analysis Speech Arena Leaderboard， only behind Inworld's Realtime TTS 1.5 Max and Google's Gemini 3.1 Flash TTS

StepAudio 2.5 TTS represents a significant step forward for StepFun from previous TTS models， with notably increased naturalness of speech samples. The model now edges out Eleven v3 on our current prompt set with an Elo score of 1，187.

Key takeaways：
➤ Quality： StepAudio 2.5 TTS has an Elo of 1，187 based on 834 arena appearances， placing it 28 points behind the leading model （Inworld TTS 1.5 Max at 1，215） and 8 points ahead of Eleven v3 at 1，179
➤ Pricing： Model is priced at $85/1M characters， a premium to leading frontier models， Inworld TTS 1.5 Max at $35/1M and Gemini 3.1 Flash TTS at $36.6/1M
➤ Speed： Model generates characters 37.6 characters per second， compared to 220.5 chars/s for Inworld TTS 1.5 Max and 30.1 chars/s for Gemini 3.1 Flash TTS
➤ Prompting： StepAudio 2.5 TTS offers two paths to control delivery of speech： 1. Global context prompt for overall style， 2. Inline contextual tags for more granular emotion and prosody

See more details and listen to samples below ⬇️