# DPVR-LF：晚期融合即可--双路径视觉Token路由应对多模态大模型视觉饱和

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmq7pt5at00jyslep0x85klvh
- 原文链接：https://arxiv.org/abs/2606.09131

## AI 摘要

通过逐层分析LLaVA-1.5发现，视觉token在中间层饱和：文本-图像注意力从层0的0.68降至层4的0.07，层18后稳定在0.04附近，而文本token持续受益于深层处理。为此提出双路径视觉Token路由框架DPVR-LF，在饱和点将视觉token路由至单层侧分支，文本token独立经过13层深层，仅在最后层融合。仅增加约3%可训练参数，即可在标准基准上保持竞争力并大幅减少视觉计算。结果表明，视觉token无需遍历所有深层语言模型层，单个晚期融合层足以维持感知能力。

## 正文

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.
