# MERIT：用于音频相似度学习的解耦音乐表示

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：39
- AIHOT 链接：https://aihot.virxact.com/items/cmpxkya4604y5slckpljvdxlo
- 原文链接：https://arxiv.org/abs/2605.27346

## AI 摘要

MERIT 是一个用于学习解耦音乐表示的框架，旨在解决当前音乐相似度模型计算单一综合分数、混合了旋律、节奏和音色等不同维度的问题。该框架为这三个核心维度分别生成特定的表示。为克服真实音频中缺乏单一维度变化数据的问题，MERIT 采用了一种结合条件音频生成与源分离音轨的新型训练策略，以鼓励训练数据中出现单因素的变化。评估结果显示，MERIT 实现了强大的因子解耦性，每个表示头对其目标感知维度有强烈响应，而在其他维度上表现接近随机，这一特性在合成训练域和独立的真实世界音频中均成立。

## 正文

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.