# AffordanceVLA：通过具身感知理解增强动作生成的视觉-语言-动作模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmq13ksup0c4wsltrdw4vn4jh
- 原文链接：https://arxiv.org/abs/2606.06155

## AI 摘要

AffordanceVLA 是一种视觉-语言-动作模型，通过引入结构化具身感知预测作为任务导向中间表示，建立更精准的感知-动作映射。模型包含三个互补组件：Which2Act（通过视觉潜变量预测实现目标中心定位以抑制干扰）、Where2Act（通过具身感知图估计定位二维交互区域）、How2Act（进行三维几何推理以引导操控策略）。采用混合 Transformer 架构，结合三阶段训练策略和渐进式数据课程，并配有自动数据增强管道。在仿真和真实世界实验中，模型在多种操控场景中取得强性能。

## 正文

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose AffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) Which2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) Where2Act for 2D interaction localization via affordance map estimation; and 3) How2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.