# 多场景长语音生成综合评测基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmpupuvkx007xsl3tgwndtyb1
- 原文链接：https://arxiv.org/abs/2605.28618

## AI 摘要

SwanBench-Speech是一个针对长语音生成的综合评测基准，涵盖长语音生成和对话生成，覆盖声学、语义和表现力挑战。该基准包含1,101个样本，横跨17种常见语音场景，并从上述三个维度定义了包含7个指标的自动化评测方案。实验揭示，当前模型在高表达性场景下表现依然吃力，且在一致性与层次感上与真实录音存在明显差距。

## 正文

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.
