# AHA-WAM：异步视界自适应世界-动作建模

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmq687vre062lsl5i35c054p6
- 原文链接：https://arxiv.org/abs/2606.09811

## AI 摘要

提出AHA-WAM，基于双扩散Transformer（DiT）架构，将视频DiT作为低频世界规划器，维护滚动键值记忆编码长程场景演化；高频动作DiT通过分层联合注意力查询上下文，在闭环中执行短动作块。引入视界自适应偏移训练与观测引导视频上下文路由（OVCR），使动作专家无需重新运行视频DiT即可利用长程世界上下文并保持对实时状态的响应。在RoboTwin基准上平均成功率达92.80%，4项真实世界任务平均成功率78.3%，闭环控制频率24.17 Hz，相比Fast-WAM加速4.59倍，无需任何机器人数据预训练。

## 正文

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.
