Light-WAM：高效世界动作模型与状态融合动作解码

2026-06-06 08:00·27天前

AI 摘要

Light-WAM是面向机器人操作的高效轻量级世界动作模型。它采用紧凑视频骨干，在降采样潜在空间中进行未来视频监督，降低视频协同训练成本。动作预测由StateFusionActionExpert完成，从多个骨干层读取状态并通过学习查询池化融合特征，在单次前向中直接预测动作块，避免重型生成式动作专家。该模型仅0.44B可训练参数，在LIBERO上保持强劲性能，在RoboTwin 2.0上达到可用多任务水平，推理延迟72.03ms，峰值GPU内存4.1GiB，并提升了训练吞吐量。

原文 · 未翻译

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

HuggingFace Daily Papers（社区热门论文）

54导出 Markdown

Light-WAM：高效世界动作模型与状态融合动作解码

2026-06-06 08:00·27天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

Light-WAM： 高效世界动作模型与状态融合动作解码

Light-WAM： 高效世界动作模型与状态融合动作解码

Light-WAM：高效世界动作模型与状态融合动作解码

Light-WAM：高效世界动作模型与状态融合动作解码