# 基于MLLM的人类视角视频理解：观看、记忆、推理

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-05 08:00
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmq4oaypg006yslt2yvr1ujcm
- 原文链接：https://arxiv.org/abs/2606.07433

## AI 摘要

该综述从人类视角审视基于多模态大语言模型的视频理解，将其组织为观看、记忆、推理三项核心能力。论文提出统一框架，通过感知表征、记忆状态、推理轨迹和最终预测刻画系统，识别了时空感知、高效长视频处理、记忆建模、流式理解及忠实推理等关键挑战。工作梳理了细粒度/全面/音视频/高效感知（观看）、离线与流式记忆（记忆）、纯文本与视频思维推理（推理）的方法，覆盖自我中心、体育、教学、医学、叙事等应用，并整理了训练数据集与评估基准，最后指明可扩展、记忆感知和证据驱动的视频智能等开放问题。

## 正文

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.