# World Action Models 综述

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-18 08:00
- AIHOT 分数：37
- AIHOT 链接：https://aihot.virxact.com/items/cmqq2kjqi060kslp5umnu6vyg
- 原文链接：https://arxiv.org/abs/2606.20781

## AI 摘要

World Action Models（WAM）是具身预测动作模型，通过重用水视频生成模型或依赖语言/视觉-语言骨干实现未来预测。该综述厘清了WAM与视频生成模型、动作基础视频世界模型、视觉-语言-动作策略等概念的边界，从生成内容（渲染未来、潜在未来、无视频生成的动作推理）和设计维度（预测基质、骨干、动作耦合、部署机制）两个视角组织现有方法。分析表明：WAM并非仅带动作头的视频生成器，其设计在表示丰富性与计算、内存、延迟、动作标签成本之间权衡。领域正朝向生成更少未来但保留控制所需信息的方向发展。

## 正文

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.
