美团LongCat发布视频世界模型评测基准WBench。该基准将测试重点从画面美观转向控制、多轮记忆、指令遵循和物理合理性等核心能力。它包含289个案例、1058个交互轮次,评估了20个模型在导航、主体动作、事件编辑等5个维度的表现,共使用22项自动指标。研究发现,没有任何模型能在所有维度上占据主导,这表明现有系统尚未将高质量渲染、可靠控制、长期记忆与物理规则遵循整合为稳定能力。WBench的设计能区分失败是源于渲染、场景设置、控制还是物理问题,并指出导航能力与视觉质量基本无关。
Most video models look better than they understand and Video quality is only the easiest thing to notice.
LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.
It exposed the gap between beautiful video generation and controllable world simulation.
A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.
WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints.