局部模态替换:实现更深度融合的视觉语言模型
阅读原文· arxiv.org现有视觉语言模型存在“载体敏感性”问题,即将文本问题替换为等义图像后性能会显著下降,原因在于训练数据中文本和图像的角色不对称。为此,研究者提出一种轻量级、架构无关的数据整理范式LoMo,其通过将单模态提示词动态重构为“文本、图像、文本”的交错多模态序列,来提供跨模态表征不变性的监督信号。在13个多模态基准测试上的实验表明,LoMo能有效提升模型的多模态推理能力,相比标准SFT,LLaVA-OneVision-1.5-8B提升2.67分,Qwen3.5-9B提升2.82分。
Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.