# 视觉状态追踪基准VSTAT：评估多模态大语言模型的视频理解能力

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-02 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpxgncg903s4slck6rkn6zjo
- 原文链接：https://arxiv.org/abs/2606.03920

## AI 摘要

研究者提出了视觉状态追踪基准VSTAT，用于诊断多模态大语言模型在视频理解中持续追踪实体与状态的能力。该基准包含834个来自合成与真实视频的片段，并配有1500个必须通过连续感知才能回答的问题。测试发现，现有顶尖多模态大语言模型在VSTAT上的表现远低于人类，仅略高于随机答案基线。分析表明，模型的文本推理能力尚可，但视觉感知所必需的事件追踪能力存在不足。初步评估也显示，包括智能体在内的现有方法也未能有效解决这一问题。

## 正文

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.