面向鲁棒视频理解的置信感知工具编排
阅读原文· arxiv.org视频推理模型假设每帧可靠,在运动模糊、眩光等扰动下准确率下降15–30%p。Robust-TO框架将每帧信任度融入推理各阶段:通过统一接口组织异构视觉工具,每个工具接收子查询和经可靠性-相关性评分筛选的可信帧,返回预测、时间定位和校准可靠性分数。推理时分数指导三层次综合(高/中/低)与置信-成本GRPO奖励,联合优化正确性、可靠性和效率。在八个任务上,Robust-TO清洗输入准确率56.4%,超过最强开源基线10.6%p和Gemini-2.5-Pro(46.2%);五种腐蚀下保持54.3%,高出最强开源基线5.8%p,且准确率下降最小。
Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.