多模态视觉语言模型的人类中心区域适应
阅读原文· arxiv.org研究人员提出人类中心区域适应新范式,设计GG-EZ方法优化多模态视觉语言模型的区域文化适应性。该方法通过区域数据过滤与模型合并,在三类架构(大视觉语言模型、文生图扩散模型、视觉语言嵌入模型)上验证,以东南亚为案例实现文化相关性提升5-15%,同时保持98%以上全球泛化性能甚至偶尔超越原模型。研究确立了人类中心区域对齐作为多模态模型区域应用的基础范式。
While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.