# V-Zero：无需答案标签的对比证据门控用于细粒度视觉推理

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-24 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmqt1qcye08a2slfug8ofvux1
- 原文链接：https://arxiv.org/abs/2606.25319

## AI 摘要

多模态大语言模型（MLLM）进行细粒度视觉推理时，传统方法依赖强化学习或大规模标注推理轨迹，成本高昂。V-Zero提出无需标注文本答案标签的框架，通过将问题相关区域裁剪与负视觉视图配对，评估学生模型采样轨迹，并门控细粒度token级知识蒸馏，引入轨迹级判别能力。在多个视觉推理基准上，V-Zero持续提升细粒度视觉推理性能并保持强泛化能力，训练速度比监督微调方法快5倍以上，比强化学习基线快10倍以上。代码和数据集将开源。

## 正文

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero
