# Flash-WAM：面向世界动作模型的模态感知蒸馏

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-03 08:00
- AIHOT 分数：62
- AIHOT 链接：https://aihot.virxact.com/items/cmq0bomks04pdsltrpi83aopo
- 原文链接：https://arxiv.org/abs/2606.05254

## AI 摘要

世界动作模型（WAMs）通过迭代扩散联合生成未来视频与机器人动作，但数十步去噪成本阻碍实时控制。Flash-WAM 提出模态感知步蒸馏：为动作流低噪声区采用线性梯度缩放参数化，为视频流高噪声区采用方差保持参数化，将推理压缩至单步。在 LingBot-VA 上实例化后，RoboTwin 2.0 每块延迟从 8.1 秒降至 348 ms（NVIDIA L40S），23 倍加速。仿真基准成功率保持（RoboTwin 2.0 85.5%，LIBERO 95.7%），真实世界 Unitree G1 人形机器人平均 60%，而朴素一致性蒸馏仅 24%。

## 正文

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce Flash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23{times} speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks (85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance (60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.