TTS评估体系存在根本性缺陷。当前主流评测标准与真实对话场景中的用户偏好严重脱节,技术迭代速度已超越基准测试的发展。针对实时对话代理的系统应在真实交互中评估,而非依赖孤立音频片段。核心问题在于,现有方法将"自然度"简化为可平均、排名的单一指标,忽视了人类语音感知的关键细节——微妙的时间变化、克制的情感表达、不均匀的呼吸节奏以及契合语境的措辞方式。
TTS evals are broken because the scores the field trusts do not match what people actually prefer in real conversations.
I think this is a solid critique because TTS has clearly improved faster than its benchmarks, and a system built for live agents should be judged inside live interaction, not on isolated clips.
The failure is not that speech models sound bad. It is that evaluation still treats naturalness like a single trait that can be averaged, ranked, and optimized.
That misses what listeners actually hear. A voice feels human through tiny timing shifts, restrained emotion, uneven breath, and phrasing that fits the moment rather than performs at every moment.