# InternVideo3：多模态上下文推理增强基础模型的长时序智能体能力

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 08:00
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmq8ws2lj06ijslldhfohb468
- 原文链接：https://arxiv.org/abs/2606.12195

## AI 摘要

InternVideo3框架通过多模态上下文推理（MCR）提升基础模型的长时序多模态任务能力。MCR将理解视为包含观察、指令、推理、工具行动和记忆的闭环过程，将长视频理解为证据积累与验证。为提升效率，引入多模态多头潜在注意力（M²LA），一种保留完整token流同时压缩KV-cache状态的token保持重参数化。训练阶段包括继续预训练、短到长监督微调、基于规则的强化学习和在线蒸馏。InternVideo3在Video-MME、MLVU、EgoSchema等基准上取得强性能，并实例化为带有检索工具的视频智能体，展现基于证据的鲁棒行为。

## 正文

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.
