# Light-WAM： 高效世界动作模型与状态融合动作解码

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-06 08:00
- AIHOT 分数：54
- AIHOT 链接：https://aihot.virxact.com/items/cmq6rifxv0b4osl5issyj1dtc
- 原文链接：https://arxiv.org/abs/2606.08242

## AI 摘要

Light-WAM是面向机器人操作的高效轻量级世界动作模型。它采用紧凑视频骨干，在降采样潜在空间中进行未来视频监督，降低视频协同训练成本。动作预测由StateFusionActionExpert完成，从多个骨干层读取状态并通过学习查询池化融合特征，在单次前向中直接预测动作块，避免重型生成式动作专家。该模型仅0.44B可训练参数，在LIBERO上保持强劲性能，在RoboTwin 2.0上达到可用多任务水平，推理延迟72.03ms，峰值GPU内存4.1GiB，并提升了训练吞吐量。

## 正文

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.
