# LaWAM：用于高效动力学感知机器人策略的潜在世界动作模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-14 20:06
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmqgqoetm025kslic5e775rzt
- 原文链接：https://arxiv.org/abs/2606.15768

## AI 摘要

LaWAM是一种潜在世界动作模型，通过在预训练视觉基础模型的特征空间中训练潜在动作模型，并利用其前向解码器预测未来观察特征，从而将预测动力学引入机器人策略，而非依赖高计算开销的未来视频重建。LaWAM在LIBERO上取得98.6%成功率、RoboTwin上取得91.22%成功率，在真实世界操作任务中也达到竞争性表现。其推理延迟为每个动作块预测187毫秒，比像素空间WAMs降低24倍壁钟延迟。

## 正文

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.
