# RepWAM：基于表征视觉-动作分词器的世界动作建模

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmqaa49oc0jjzslldveyn0vul
- 原文链接：https://arxiv.org/abs/2606.13674

## AI 摘要

RepWAM是一种表征中心的世界动作模型（WAM），构建在表征视觉-动作tokenizer上。现有WAM沿用重建导向视频tokenizer，但像素重建对学习指令跟随动力学帮助有限。为此，研究训练表征视觉-动作tokenizer将视觉输入映射为对齐的视觉和潜在动作token，预训练WAM联合建模未来视觉状态及连接它们的潜在动作，再适配真实机器人轨迹实现闭环操作。实验表明RepWAM在多种操控场景表现强劲，消融实验凸显语义视觉-动作tokenizer的优势。代码与权重将开源。

## 正文

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.