# ImageWAM：世界动作模型真的需要视频生成，还是只需要图像编辑？

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-17 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmqkbvrcp04u9slhil6ztjh3c
- 原文链接：https://arxiv.org/abs/2606.19531

## AI 摘要

ImageWAM将预训练图像编辑模型用于机器人动作预测，无需视频生成。推理时利用图像编辑去噪产生的KV缓存作为世界动作上下文，不解码目标帧。在模拟器和真实世界实验中，ImageWAM性能优于标准VLA基线和竞争WAM，同时计算量（FLOPs）降低至1/6，延迟降低至1/4。注意力分析表明编辑缓存聚焦于任务相关区域，验证了图像编辑可作为视频生成的有效替代方案。

## 正文

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.
