新基准测试证实AI视频生成器视觉效果惊艳，但仍无法进行世界推理

2026-05-16 18:55·47天前·Jonathan Kemper

AI 摘要

WorldReasonBench新基准测试评估AI视频生成器的物理和逻辑合理性，而非图像质量。ByteDance的Seedance 2.0在该测试中领先，超过Veo 3.1和Sora 2。商业模型的得分大约是开源模型的两倍，逻辑推理是所有模型中最困难的类别，表现差距显著。这表明AI视频生成器虽能产生惊艳视觉效果，但尚未实现从像素生成器到真实世界模型的飞跃。

原文 · 未翻译

New benchmark confirms AI video generators look stunning but still can't reason about the world

Modern video generators like Sora 2, Seedance 2.0, and Veo 3.1 produce increasingly impressive clips. But a new benchmark from Tsinghua University confirms what keeps coming up: visual quality and actual world understanding are two different things.

Instead of focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a way that makes sense: physically, socially, logically, and informationally.

Consider a basic test case: give a generator an image of an apple on a branch and tell it to drop the apple. The result might look great—smooth motion, realistic textures, nice lighting—and still get the physics fundamentally wrong. The apple might fly upward, pop like a balloon, or fall in a straight line instead of curving. Standard quality metrics would still reward that video for its realism. That's the gap WorldReasonBench is designed to catch.

WorldReasonBench includes about 400 test cases across four areas: world knowledge (physics, weather, cultural norms), human-centered scenes (object handling, social interaction), logical reasoning (math, geometry, science experiments), and information-based reasoning (reading data and diagrams).

Scoring works in two stages. First, a process-aware method uses structured questions to check whether the video reaches the right end state in a plausible way. Then a second pass rates reasoning quality, temporal consistency, and visual aesthetics. Alongside the benchmark, the team also released WorldRewardBench, a dataset of about 6,000 video comparisons ranked by trained annotators.

Commercial models lead by a wide margin, but logic trips up everyone

The researchers tested five commercial systems (Sora 2, Kling, Wan 2.6, Seedance 2.0, Veo 3.1-Fast) and six open-source models (LTX 2.3, Wan 2.2-14B, UniVideo, HunyuanVideo 1.5, Cosmos-Predict 2.5, LongCat-Video). Commercial generators scored roughly double what open-source models managed on the core reasoning metric, with no statistical overlap between the two groups.

The Decoder：AI News（RSS）

44导出 Markdown

新基准测试证实AI视频生成器视觉效果惊艳，但仍无法进行世界推理

2026-05-16 18:55·47天前·Jonathan Kemper

阅读原文· the-decoder.com

AI 摘要

原文 · 保持原样，未翻译

New benchmark confirms AI video generators look stunning but still can't reason about the world

Instead of focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a way that makes sense: physically, socially, logically, and informationally.