# 丢弃-恢复：视觉-语言-动作模型有多冗余？

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-26 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmr13lxuv01aisldxkq603ezv
- 原文链接：https://arxiv.org/abs/2606.27755

## AI 摘要

视觉-语言-动作（VLA）模型继承预训练VLM中过大的语言骨干，引发冗余质疑。Drop-Then-Recovery（DTR）协议通过删除Transformer块并微调恢复，结合单次虚拟门控敏感度指标GateProbe评测容量必要性。在LIBERO上，删除半数LLM块后OpenVLA-OFT在相同微调预算下从95.0%升至98.3%，仅保留两个语言块仍恢复基线性能；但视觉与动作路径对删除耐受性显著更低。结果表明现有VLA基准对深层语言理解压力不足，未来架构应更均衡分配语言、视觉、动作容量。代码已开源。

## 正文

Vision-Language-Action (VLA) models enable instruction-driven robotic manipulation, but they inherit oversized language backbones from pretrained VLMs whose capacity far exceeds what is needed for short robotic instructions. This raises a basic question: how much of a VLA model is actually necessary for closed-loop control? In this work, we study architectural redundancy in VLA models by using transformer block removal as a controlled intervention. We introduce Drop-Then-Recovery (DTR), an analysis protocol that removes selected blocks from a pretrained VLA model and then fine-tunes the resulting model to measure whether the removed capacity was necessary for downstream control. To make this intervention reliable, we propose GateProbe, a one-shot virtual-gate sensitivity metric that ranks blocks by their contribution to the downstream action loss. Across multiple VLA architectures, manipulation benchmarks and even real-robot industrial scenarios, we find a strong asymmetry in post-removal recoverability: \textit{language backbones are highly redundant for standard robotic manipulation tasks, whereas vision and action pathways are substantially less tolerant to removal}. On LIBERO, removing half of the LLM blocks even improves OpenVLA-OFT from 95.0% to 98.3% under the same downstream fine-tuning budget, and retaining only two language blocks still recovers baseline-level performance. These results suggest that current VLA benchmarks may exert limited pressure on deep language grounding and compositional instruction understanding, and that future VLA architectures should allocate capacity more deliberately across language, vision, and action components. The code is available at https://github.com/s1ghhh/VLADrop.
