V-Zero：无需答案标签的对比证据门控用于细粒度视觉推理

2026-06-24 08:00·9天前

AI 摘要

多模态大语言模型（MLLM）进行细粒度视觉推理时，传统方法依赖强化学习或大规模标注推理轨迹，成本高昂。V-Zero提出无需标注文本答案标签的框架，通过将问题相关区域裁剪与负视觉视图配对，评估学生模型采样轨迹，并门控细粒度token级知识蒸馏，引入轨迹级判别能力。在多个视觉推理基准上，V-Zero持续提升细粒度视觉推理性能并保持强泛化能力，训练速度比监督微调方法快5倍以上，比强化学习基线快10倍以上。代码和数据集将开源。

原文 · 未翻译

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero

HuggingFace Daily Papers（社区热门论文）

44导出 Markdown

V-Zero：无需答案标签的对比证据门控用于细粒度视觉推理

2026-06-24 08:00·9天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译