# Look Light， Think Heavy：多模态Chain-of-Thought推理能做什么、不能做什么

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-21 08:00
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmqsxg0qd075yslfuzuiu5bb0
- 原文链接：https://arxiv.org/abs/2606.22565

## AI 摘要

系统评估12个多模态任务（14个非推理模型、8个推理模型），发现：①CoT并非免费午餐——在视觉定位、物体计数等感知任务中反而降低性能，在数学、科学、多图像推理中有效；②现有开源多模态推理模型相比原始模型整体提升有限，可能因过度侧重数学推理而牺牲其他能力；③视觉推理是瓶颈，模型呈现“Look Light, Think Heavy”模式——语言反思起伏，视觉反思持续减弱，缺乏全程深度视觉内省。

## 正文

Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.