超越文本主导:理解全模态大语言模型的模态偏好
阅读原文· arxiv.org研究团队针对原生全模态大语言模型(OLLMs)的模态偏好现象,构建了冲突基准测试并提出模态选择率指标,对10个代表性模型进行系统评估。结果发现与传统视觉语言模型的"文本主导"不同,多数OLLMs呈现显著视觉偏好,且通过逐层探测证实该偏好是在中后层逐渐涌现而非静态存在。基于这一机制,团队利用内部信号诊断跨模态幻觉,在三个多模态基准测试中取得竞争性表现,无需任务特定数据。
Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference