# 感知的代价：在整体框架内实现可信的多模态推理

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-21 08:00
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmpkp4pvu08xwsl01ilh6izn1
- 原文链接：https://arxiv.org/abs/2604.20665

## AI 摘要

当前视觉语言模型常出现“功能性失明”，即利用强大的语言先验绕过视觉表征瓶颈，而非真正融合多模态信息。本研究挑战了依赖数据消融的传统评估方法，提出了信息论框架下的“模态翻译协议”来量化“感知的代价”。该方法定义了三个新指标（Toll, Curse, Fallacy）与语义充分性准则。研究还假设存在多模态缩放的“分歧定律”：随着语言模型推理能力增强，视觉知识瓶颈带来的性能惩罚可能不降反升。这为构建更可信的多模态推理系统提供了新的评估工具与设计思路。

## 正文

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.
