TuneJury:开放的音乐生成偏好对齐奖励模型
阅读原文· arxiv.orgTuneJury 是一个开放的实例级成对奖励模型,从文本提示和音频片段预测音乐偏好分数。其检查点基于公开的人类偏好标签训练,涵盖竞技场风格 A vs B 投票、度量对齐偏好对、众包成对比较和专家美学评级。预测分数差距在 held-out 测试集上校准良好,支持通过简单阈值过滤数据。TuneJury 可泛化到分布外基准,优于先前基线。引入 anchor calibration(事后、每系统的 Bradley-Terry 校准),以比从头再训练更高的数据效率恢复一致性。相同冻结奖励在 best-of-N 选择、DITTO 风格潜在优化和专家迭代后训练三个下游应用中驱动一致奖励轴增益。
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.