# YOCAUSAL： 视频生成距世界模型有多远？一个因果关系的视角

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmpqfb3d604ibslnobddrpsql
- 原文链接：https://arxiv.org/abs/2605.30346

## AI 摘要

本文提出YOCAUSAL，一个受认知科学“违反期望”范式启发的两层级基准测试，用于评估视频扩散模型（VDMs）的因果理解能力。Level 1通过零成本的时间反转真实视频构建反事实样本，引入“反转惊奇指数”（RSI）量化模型对时间箭头的感知。Level 2引入“因果认知指数”（CCI），利用视觉语言模型将数据集分层，以区分真正的因果推理与时间偏差。对13个先进VDMs的评估表明，感知时间箭头并不等同于理解因果关系，当前模型在因果认知方面与人类水平仍存在显著差距。

## 正文

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
