ViGOS：视觉引导的在线自蒸馏框架

2026-06-17 08:00·16天前

AI 摘要

针对在线自蒸馏（OPSD）直接扩展到多模态大语言模型（MLLM）时产生的捷径（特权目标依赖文本参考而非图像），ViGOS提出视觉引导的OPSD框架：学生先写出视觉描述再推理。有效rollout中，纯图像感知教师监督描述，特权推理教师监督推理和答案；无效rollout由参考教师恢复输出格式。ViGOS在通用视觉语言、专家推理等基准上保持OPSD优势，并改善了图像依赖行为。

原文 · 未翻译

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

HuggingFace Daily Papers（社区热门论文）

50导出 Markdown

ViGOS：视觉引导的在线自蒸馏框架

2026-06-17 08:00·16天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译