# 以自监督引导增强视觉指令微调

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-14 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo2z0qjp04p9slbao4sxapx5
- 原文链接：https://arxiv.org/abs/2604.12966

## AI 摘要

多模态大语言模型在视觉中心任务中常因指令微调时视觉信息利用不足而表现欠佳。研究团队提出一种轻量级方法，将旋转预测、颜色匹配等经典自监督前置任务重构为图像-指令-回复三元组，无需人工标注或架构修改即可增强视觉指令微调。实验表明，仅在训练数据中注入3-10%的此类基于视觉的指令，就能在多个模型和基准测试上持续提升细粒度视觉推理性能。

## 正文

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT