# APT：通过动作专家预训练提升视觉-语言-动作策略的语言指令泛化能力

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 08:00
- AIHOT 分数：37
- AIHOT 链接：https://aihot.virxact.com/items/cmqf8mry903o3slwa31zjhi91
- 原文链接：https://arxiv.org/abs/2606.12366

## AI 摘要

视觉-语言-动作（VLA）模型将预训练VLM与连续动作专家结合，但在分布外语言指令上泛化差——原因是数据中语言多样性低且动作专家随机初始化导致梯度噪声削弱VLM。APT从贝叶斯视角将策略分解为语言无关的视觉-动作（VA）先验和语言条件VLA似然，采用两阶段训练：阶段1冻结VLM，在视觉-动作对上预训练动作专家作为VA先验；阶段2通过门控融合注入语言token，保留已学习的视觉运动先验。APT适用于π和GR00T风格架构，在未见指令和组合任务上实现一致提升。

## 正文

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the π and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/