超越文本主导：理解全模态大语言模型的模态偏好

2026-04-18 08:00·76天前

AI 摘要

研究团队针对原生全模态大语言模型（OLLMs）的模态偏好现象，构建了冲突基准测试并提出模态选择率指标，对10个代表性模型进行系统评估。结果发现与传统视觉语言模型的"文本主导"不同，多数OLLMs呈现显著视觉偏好，且通过逐层探测证实该偏好是在中后层逐渐涌现而非静态存在。基于这一机制，团队利用内部信号诊断跨模态幻觉，在三个多模态基准测试中取得竞争性表现，无需任务特定数据。

原文 · 未翻译

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

超越文本主导：理解全模态大语言模型的模态偏好

2026-04-18 08:00·76天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译