# 当前前沿模型视觉理解的幻象

- 来源：Gary Marcus：The Road to AI We Can Trust（RSS）
- 作者：Gary Marcus
- 发布时间：2026-03-29 22:32
- AIHOT 标记：精选
- AIHOT 链接：https://aihot.virxact.com/items/cmnxjn4xh003ssl9of059x3ap
- 原文链接：https://garymarcus.substack.com/p/the-mirage-of-visual-understanding

## 精选理由

揭示多模态基准测试漏洞，医学AI应用需警惕数据泄露风险

## AI 摘要

当前前沿多模态大模型在标准胸部X光问答基准测试中，无需访问任何图像即可获得顶级排名。这一反常现象暴露出模型视觉理解能力的严重缺陷，表明其性能可能依赖数据偏见或文本线索而非真实的图像解析能力。研究揭示了现有视觉语言模型评估体系的深层漏洞，指出所谓"视觉理解"可能只是缺乏真实感知能力的幻觉。

## 正文

From a damning new Stanford paper on the illusion of visual understanding in LLMs:

“Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided, we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. "

AGI this stuff ain’t.

This study reinforces what Anh Totti Nguyen has been saying for a long time, in a series of underappreciated papers like Vision Language Models are Blind that I keep trying to draw attention to.

Also, re the very active discussion on AI and jobs: although some white collar jobs (e.g., entry-level coder or market research assistant) may be in near-term jeopardy, many of those that require visual understanding (architect, cartographer, civil engineer, film editor, medical illustrator, urban planner, etc) probably aren’t vulnerable until entirely new techniques are developed.

And humanoid home robots? Don’t make me laugh. If your humanoid robot can’t understand the visual world, it’s just a demo, and not something you can trust.
