# VLM作为视频推理教师：通过自适应测试时优化实现

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmpw3ayof03zaslukz2znw7rg
- 原文链接：https://arxiv.org/abs/2606.02564

## AI 摘要

本研究提出一种新范式，将视觉语言模型的角色从问题“求解者”转变为指导视频生成模型的“教师”。现有VLM作为求解器效果不佳，但其感知能力强，可评估任务规则满足度。新方法利用VLM提取任务规则，构建可微分奖励，并通过测试时在线优化轻量级LoRA模块，引导视频生成模型推理。在VBVR-Bench和RULER-Bench两个视频推理基准上，该方法平均性能提升16.7分，显著优于其他基线方法。

## 正文

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/
