# MBench：面向视频世界模型记忆能力的综合基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqeoxx5004ayslunosaa8kr2
- 原文链接：https://arxiv.org/abs/2606.00793

## AI 摘要

现有视频世界模型基准主要关注视觉质量、运动连贯性和文本-视频对齐，忽略了作为世界模型核心能力的长期记忆。MBench将记忆能力系统分解为实体一致性、环境一致性和因果一致性三个层级维度，并细化为12个可量化子维度。基于精心挑选的真实长视频，结合规则量化矩阵和视觉语言模型进行客观评估。对多个主流视频世界模型的评测揭示了现有方法在长期状态保持方面的系统性局限，为领域提供了标准化基准和明确研究方向。

## 正文

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.