# WALL-WM：沿事件节点雕刻世界动作建模

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpypojhi0265sli337wvxx6k
- 原文链接：https://arxiv.org/abs/2606.01955

## AI 摘要

WALL-WM 是一种世界动作模型，将视频-动作学习从固定长度块优化转向基于语义事件的视觉-语言-动作（VLA）预训练。它把语义一致的动作事件作为基本学习单元，解决了语言、视觉与动作在时间粒度上的不匹配。WALL-WM 结合事件级描述与聚类平衡采样构建数据生态，并从同一预训练主干支持两种推理模式：事件模式（变长执行块）和统一模式（使用 VLM 与阶梯解码）。依托 Muon 优化器的大规模预训练基础设施，WALL-WM 在跨语言、场景与任务的真实世界泛化评估中达到当前最优性能。

## 正文

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.