Qwen-RobotWorld 技术报告:基于语言条件视频生成的具身世界模型
阅读原文· arxiv.orgQwen-RobotWorld 是一个语言条件视频世界模型,以自然语言为统一动作接口,从当前观测预测物理可行的未来视觉轨迹,覆盖机器人操作、自动驾驶、室内导航和人到机器人迁移。其核心设计包括:60 层双流 Diffusion Transformer(Double-Stream MMDiT)耦合冻结的 Qwen2.5-VL 语义与视频-VAE 隐特征;具身世界知识语料库(860 万视频-文本对,超 2 亿帧,含 20 余种具身形态和 500 余种动作);通用+专家渐进式课程训练,先学习通用视觉先验再注入具身专用知识。在 EWMBench 和 DreamGen Bench 上总分第一,在 WorldModelBench 和 PBench 上超越所有开源模型,RoboTwin-IF 零样本分析验证了泛化性与多视角一致性。
We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.