Tailor-Bench:修剪视觉世界建模评估的长尾
阅读原文· arxiv.orgTailor-Bench评估视觉世界模型模拟非常规物理交互的能力,设计三种渐进难度场景:常规(常见工具-任务组合)、非常规(属性兼容替代品)、不可能(违反属性工具)。在统一协议下,预测生成与描述生成分别测试无引导推理与忠实实现。实验表明模型性能从常规到非常规再到不可能逐步退化,暴露物理建模的长尾差距。失败分析显示图像模型无法实现正确状态变化,视频模型还有时间不一致,说明模型依赖表面视觉模式而非内化物理原理。
Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.