# Qwen-RobotWorld 技术报告：基于语言条件视频生成的具身世界模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：51
- AIHOT 链接：https://aihot.virxact.com/items/cmqg30by7032bslsp1hamy995
- 原文链接：https://arxiv.org/abs/2606.17030

## AI 摘要

Qwen-RobotWorld 是一个语言条件视频世界模型，以自然语言为统一动作接口，从当前观测预测物理可行的未来视觉轨迹，覆盖机器人操作、自动驾驶、室内导航和人到机器人迁移。其核心设计包括：60 层双流 Diffusion Transformer（Double-Stream MMDiT）耦合冻结的 Qwen2.5-VL 语义与视频-VAE 隐特征；具身世界知识语料库（860 万视频-文本对，超 2 亿帧，含 20 余种具身形态和 500 余种动作）；通用+专家渐进式课程训练，先学习通用视觉先验再注入具身专用知识。在 EWMBench 和 DreamGen Bench 上总分第一，在 WorldModelBench 和 PBench 上超越所有开源模型，RoboTwin-IF 零样本分析验证了泛化性与多视角一致性。

## 正文

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.