# MuSViT：乐谱表示的基础视觉模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-30 08:00
- AIHOT 分数：43
- AIHOT 链接：https://aihot.virxact.com/items/cmr1vhkjz02ewsl8z2ssiuymj
- 原文链接：https://arxiv.org/abs/2606.31811

## AI 摘要

MuSViT是首个专为乐谱表示设计的基础视觉模型，采用ViT编码器并通过掩码自编码器在IMSLP的970万页乐谱上预训练，使用两阶段课程（先合成排版乐谱，再训练完整IMSLP语料库）。在四个下游任务（全页与谱行级乐谱识别、音乐符号检测、难度分类）上，线性探测（冻结编码器）中MuSViT持续优于通用视觉编码器，微调则改进多数任务的特化SOTA方法。嵌入-转录一致性分析表明，MuSViT直接在表示空间中编码符号化的音乐结构，而其他编码器的嵌入与乐谱内容不相关。

## 正文

Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score Vision Transformer): the first foundation vision model for sheet music representation -- a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the IMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: a synthetic warm-up on typeset scores followed by large-scale training on the full IMSLP corpus. We evaluate MuSViT on four downstream tasks -- full-page and staff-level music score recognition, music symbol detection, and score difficulty classification -- under two scenarios: linear probing (frozen encoder) and fine-tuning. Under linear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Under fine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additional embedding-transcription consistency analysis reveals that MuSViT encodes symbolic musical structure directly in its representation space -- unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone for sheet music understanding.