# 新基准测试证实AI视频生成器视觉效果惊艳，但仍无法进行世界推理

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-05-16 18:55
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmp89ffev0gt8slnzstuji07f
- 原文链接：https://the-decoder.com/new-benchmark-confirms-ai-video-generators-look-stunning-but-still-cant-reason-about-the-world

## AI 摘要

WorldReasonBench新基准测试评估AI视频生成器的物理和逻辑合理性，而非图像质量。ByteDance的Seedance 2.0在该测试中领先，超过Veo 3.1和Sora 2。商业模型的得分大约是开源模型的两倍，逻辑推理是所有模型中最困难的类别，表现差距显著。这表明AI视频生成器虽能产生惊艳视觉效果，但尚未实现从像素生成器到真实世界模型的飞跃。

## 正文

New benchmark confirms AI video generators look stunning but still can't reason about the world

Modern video generators like Sora 2, Seedance 2.0, and Veo 3.1 produce increasingly impressive clips. But a new benchmark from Tsinghua University confirms what keeps coming up: visual quality and actual world understanding are two different things.

Instead of focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a way that makes sense: physically, socially, logically, and informationally.

Consider a basic test case: give a generator an image of an apple on a branch and tell it to drop the apple. The result might look great—smooth motion, realistic textures, nice lighting—and still get the physics fundamentally wrong. The apple might fly upward, pop like a balloon, or fall in a straight line instead of curving. Standard quality metrics would still reward that video for its realism. That's the gap WorldReasonBench is designed to catch.

WorldReasonBench includes about 400 test cases across four areas: world knowledge (physics, weather, cultural norms), human-centered scenes (object handling, social interaction), logical reasoning (math, geometry, science experiments), and information-based reasoning (reading data and diagrams).

Scoring works in two stages. First, a process-aware method uses structured questions to check whether the video reaches the right end state in a plausible way. Then a second pass rates reasoning quality, temporal consistency, and visual aesthetics. Alongside the benchmark, the team also released WorldRewardBench, a dataset of about 6,000 video comparisons ranked by trained annotators.

Commercial models lead by a wide margin, but logic trips up everyone

The researchers tested five commercial systems (Sora 2, Kling, Wan 2.6, Seedance 2.0, Veo 3.1-Fast) and six open-source models (LTX 2.3, Wan 2.2-14B, UniVideo, HunyuanVideo 1.5, Cosmos-Predict 2.5, LongCat-Video). Commercial generators scored roughly double what open-source models managed on the core reasoning metric, with no statistical overlap between the two groups.

ByteDance's Seedance 2.0 came out on top, finishing first in nearly nine out of ten statistical re-runs. Veo 3.1-Fast did best on world knowledge, Sora 2 led on human-centered scenes. Seedance 2.0 also beat Veo 3.1-Fast, Kling, and Wan 2.6 in human ratings.

More important than the rankings is a shared weakness: logical reasoning is the hardest category for every model tested. Even the best commercial systems drop well below their overall averages here, and most open-source models fail it almost entirely. Information-based reasoning is the second-toughest area, particularly when tasks require physically grounded transitions or exact preservation of text and numbers.

The study also introduces a metric that tracks how many correct answers come from dynamic, process-based phases rather than static snapshots. Commercial models score much higher here, which points to where open-source models really fall short: not in how things look, but in understanding cause and effect.

When models get more detailed prompts that spell out what should happen step by step, open-source generators improve the most. They're simply more dependent on prompt quality than their commercial rivals, which may itself be a side effect of the commercial models' stronger reasoning ability.

Automated scoring lines up with human judgment

To validate their approach, the team compared their metrics against rankings from human video comparisons. The core metric tracks closely with human judgment and clearly outperforms traditional AI judges that compare videos in pairs.

The conclusion fits a growing body of evidence: despite real progress in resolution, length, and controllability, the jump from pixel generator to reliable world model hasn't happened. Getting there will likely depend less on visual polish and more on a better grasp of causal mechanisms and the ability to keep information consistent over time. The benchmark, data, and code are available on GitHub.

An international team of researchers recently reached a similar conclusion: Sora 2 and Veo 3.1 fall well short of human performance on reasoning tasks. Whether video generators even qualify as "world models" remains a contested question in AI research. Meta's Yann LeCun considers systems like Sora a dead end, while DeepMind CEO Demis Hassabis sees Google's Veo as a step toward a world model. OpenAI shut down Sora as a commercial video generator but kept the team intact to focus on world model research. A proposed definition called OpenWorldLib explicitly rules out pure text-to-video models from the category.

AI News Without the Hype – Curated by Humans
