# MMAE：大规模多任务音频编辑基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-05 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmq4slb4n01bsslt29nbs6x92
- 原文链接：https://arxiv.org/abs/2606.07229

## AI 摘要

MMAE是首个专为通用指令音频编辑设计的综合评估基准，涵盖声音、语音、音乐及其混合共7种音频模态，并建立包含6级任务复杂度、2级细粒度与8种操作类型的分类体系。通过人机协作精心筛选2000个高保真样本，配套基于评分标准的评估框架，将自由形式任务分解为17,741个可验证指标，实现指令遵循与上下文一致性的精确多维度评估。对主流模型的评测显示，精确匹配率（EMR）整体低于5%，在复杂混合模态任务中降至绝对0%。

## 正文

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.