新基准测试证实AI视频生成器视觉效果惊艳,但仍无法进行世界推理
阅读原文· the-decoder.comWorldReasonBench新基准测试评估AI视频生成器的物理和逻辑合理性,而非图像质量。ByteDance的Seedance 2.0在该测试中领先,超过Veo 3.1和Sora 2。商业模型的得分大约是开源模型的两倍,逻辑推理是所有模型中最困难的类别,表现差距显著。这表明AI视频生成器虽能产生惊艳视觉效果,但尚未实现从像素生成器到真实世界模型的飞跃。
New benchmark confirms AI video generators look stunning but still can't reason about the world
Modern video generators like Sora 2, Seedance 2.0, and Veo 3.1 produce increasingly impressive clips. But a new benchmark from Tsinghua University confirms what keeps coming up: visual quality and actual world understanding are two different things.
Instead of focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a way that makes sense: physically, socially, logically, and informationally.
Consider a basic test case: give a generator an image of an apple on a branch and tell it to drop the apple. The result might look great—smooth motion, realistic textures, nice lighting—and still get the physics fundamentally wrong. The apple might fly upward, pop like a balloon, or fall in a straight line instead of curving. Standard quality metrics would still reward that video for its realism. That's the gap WorldReasonBench is designed to catch.