# WorldBench：一个挑战性强且视觉多样化的多模态推理基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmq4m5sl902a8slotxszyg2ai
- 原文链接：https://arxiv.org/abs/2606.06538

## AI 摘要

WorldBench是一个用于评估多模态大语言模型(MLLM)的推理基准，通过构建涵盖多个领域（如生物）的数千个视觉概念分类体系，从搜索引擎和现有数据集中广泛收集图片，并采用结构化试错方法手动设计前沿MLLM难以回答的挑战性问题。在15个MLLM上的评估显示，最强模型准确率仅达64.0%，部分模型性能略高于随机水平，揭示了现有模型在视觉理解上的不足。该基准在视觉多样性上优于现有任何多样化基准。

## 正文

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.