TIPSv2:通过增强的 Patch-文本对齐推进视觉-语言预训练
阅读原文· arxiv.org研究团队发布 TIPSv2 图像-文本编码器模型家族,针对密集 Patch 表示与文本嵌入对齐难题提出多项改进。核心创新包括 iBOT++ 训练目标(让未掩码 token 直接参与损失计算)、Patch 级蒸馏技术(学生模型对齐能力竟超越教师模型)、优化指数移动平均机制及多粒度合成 Caption 采样策略。在涵盖 9 项任务和 20 个数据集的综合评测中,TIPSv2 性能与近期主流视觉编码器相当或更优。
Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .