Discrete-WAM:统一离散视觉-动作Token编辑用于世界-策略学习
阅读原文· arxiv.org自动驾驶需推理自车动作如何影响世界演化,现有端到端方法依赖直接状态-动作映射,缺乏对动作条件动力学的显式建模;连续潜空间世界模型缺乏组合因果推理。Discrete-WAM提出统一潜视觉-动作世界策略,将未来视觉状态与自车动作表示为对齐的离散token,在离散扩散框架内联合实现世界建模、世界-动作策略和层级决策策略,支持跨替代未来的组合因果推理与可控生成。在大规模自动驾驶基准上取得有竞争力的性能。
Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end-to-end methods rely on direct state-to-action mappings, capturing correlations without explicitly modeling action-conditioned dynamics. Conversely, continuous-latent world models often lack compositional structure for causal reasoning across counterfactual futures. We introduce Discrete-WAM, a unified latent vision-action world policy that represents future visual states and ego actions as aligned discrete tokens, enabling compositional causal reasoning across alternative futures. Built upon this unified discrete alignment, Discrete-WAM establishes a shared discrete diffusion framework with unified generative tasks, jointly formulating world modeling, world-action policy, and hierarchical decision-enabled policy, supporting compositional generalization across diverse driving scenarios. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves competitive performance while supporting controllable generation and counterfactual reasoning, offering a principled path toward more reliable decision-making.