# Mind's Eye：多模态 LLM 视觉抽象、转换与组合基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-17 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo9ps1nz03vnsls2p8nm9qba
- 原文链接：https://arxiv.org/abs/2604.16054

## AI 摘要

研究团队发布"Mind's Eye"基准测试，涵盖8项视觉认知任务，依据"抽象-关系-转换"（A-R-T）分类体系评估多模态大语言模型的流体推理能力。结果显示，人类参与者准确率达80%，而顶尖模型不足50%。错误分析揭示，现有模型在视觉注意力分配、内部感知操作和底层概念抽象方面存在明显缺陷，表明当前多模态大语言模型的视觉空间推理能力仍显著落后于人类水平。

## 正文

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.