MuSViT:乐谱表示的基础视觉模型
阅读原文· arxiv.orgMuSViT是首个专为乐谱表示设计的基础视觉模型,采用ViT编码器并通过掩码自编码器在IMSLP的970万页乐谱上预训练,使用两阶段课程(先合成排版乐谱,再训练完整IMSLP语料库)。在四个下游任务(全页与谱行级乐谱识别、音乐符号检测、难度分类)上,线性探测(冻结编码器)中MuSViT持续优于通用视觉编码器,微调则改进多数任务的特化SOTA方法。嵌入-转录一致性分析表明,MuSViT直接在表示空间中编码符号化的音乐结构,而其他编码器的嵌入与乐谱内容不相关。
Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score Vision Transformer): the first foundation vision model for sheet music representation -- a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the IMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: a synthetic warm-up on typeset scores followed by large-scale training on the full IMSLP corpus. We evaluate MuSViT on four downstream tasks -- full-page and staff-level music score recognition, music symbol detection, and score difficulty classification -- under two scenarios: linear probing (frozen encoder) and fine-tuning. Under linear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Under fine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additional embedding-transcription consistency analysis reveals that MuSViT encodes symbolic musical structure directly in its representation space -- unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone for sheet music understanding.