WorldOlympiad:视频世界模型三项全能评测基准
阅读原文· arxiv.orgWorldOlympiad 将视频世界模型评估分解为物理、几何和交互三个维度。物理轨道用物体分割和 MLLM-as-judge 检验视频对力学、热现象、材料属性等规则的遵循;几何轨道以高斯泼溅重建评估结构一致性、跨视角连贯性与相机轨迹对齐;交互轨道评测模型能否按复杂动作提示生成连贯长程视频。基准覆盖游戏、机器人和通用真实视频三大场景。实验表明,当前最先进模型在物理推理、3D 一致性和长程交互上存在显著差距。
We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.