VaaWIT:面向多语网页图像翻译的视觉感知大语言模型适配框架
阅读原文· arxiv.org翻译网页图像中的文本对提升内容可访问性至关重要。现有大型视觉语言模型因视觉表征差距,常忽视识别多样字符形态所需的细粒度视觉细节,导致在此任务上表现不佳。为此,本研究提出VaaWIT框架,它通过双流注意力模块实现多语义特征与视觉细节的双向交互,并利用视觉感知适配器以参数高效微调方式将融合特征注入冻结的大语言模型骨干。实验表明,该框架在三个公开基准的八个任务上显著超越了SOTA开源基线模型,性能可与闭源模型相媲美。
Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.