# 基于LLM的多模态音乐推荐系统

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmq0kabae0704sltr0c1aywkn
- 原文链接：https://arxiv.org/abs/2606.00125

## AI 摘要

研究提出一个基于LLM的多模态音乐推荐框架，在LastFM-1K数据集上融合三类信号：预训练模型提取的音频与歌词嵌入、使用MGPHot标注框架生成的LLM语义元数据、以及听歌完成率。该框架基于E4SRec扩展，集成SASRec、BERT4Rec、GRU4Rec等编码器，并引入LLaMa-2-13B、Qwen2.5-7B-Instruct和LLaMa-3-70B进行零样本与微调实验。相比仅使用歌曲ID的基线，内容特征融合使Recall最高提升95%、NDCG提升79%。研究还发现，简单拼接多模态特征并不总能带来叠加提升，并开放了一个大规模音乐推荐多模态基准。

## 正文

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.