Rohan Paul@rohanpaul_ai

2026-06-02 17:16·30天前

AI 摘要

美团LongCat发布视频世界模型评测基准WBench。该基准将测试重点从画面美观转向控制、多轮记忆、指令遵循和物理合理性等核心能力。它包含289个案例、1058个交互轮次，评估了20个模型在导航、主体动作、事件编辑等5个维度的表现，共使用22项自动指标。研究发现，没有任何模型能在所有维度上占据主导，这表明现有系统尚未将高质量渲染、可靠控制、长期记忆与物理规则遵循整合为稳定能力。WBench的设计能区分失败是源于渲染、场景设置、控制还是物理问题，并指出导航能力与视觉质量基本无关。

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench， it turned video world model testing from a beauty contest into a stress test for control， multi-turn memory， instruction-following， and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough， because a usable world model must keep the same scene， obey later actions， move the camera correctly， preserve objects， and avoid impossible cause-and-effect.

WBench tests this with 289 cases， 1，058 interaction turns， 20 models， 5 dimensions， and 22 automatic metrics， covering navigation， subject actions， event edits， perspective switches， and both viewpoints.

Across all those 20 evaluated models， the paper finds that no model dominates all dimensions， which means current systems have not yet merged high-quality rendering， reliable control， long-horizon memory， and physical rule-following into one stable capability.

Its design separates the world setup from the user action， so researchers can identify whether a failure comes from weak rendering， poor scene setup， bad control， lost state， or broken physics.

Navigation has near-zero connection with visual quality， consistency， or physics， meaning a model can look strong while still failing to move on command.

The key shift： stop asking only "does the video look good？" and start asking "can the model keep a controllable world alive across many turns？"

🧵 1.

多模态

Rohan Paul@rohanpaul_ai · X

65导出 Markdown