ABot-M0.5:统一的移动与操作世界动作模型
阅读原文· arxiv.orgABot-M0.5 是一种面向机器人移动操作的新型 World Action Model(WAM)。它从三个层面解决现有 WAM 的对齐问题:时间粒度对齐通过引入中间潜在动作捕捉局部视觉状态转换,作为视频潜在与具身控制之间的桥接;动作空间对齐采用双级 Mixture-of-Transformers 架构,解耦模态表示与异构动作子空间(如底座移动与机械臂操作);推理条件对齐提出 dream-forcing 训练策略,在模型预测视频上逐步训练逆动力学,提升自回归推理时的对齐与鲁棒性。在移动与精细操作基准上,ABot-M0.5 在长程任务成功率与细粒度控制精度上均达到当前最优。
Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.