Qwen-RobotManip 技术报告:对齐解锁机器人操作基础模型的规模化能力
阅读原文· arxiv.orgQwen-RobotManip 是基于 Qwen-VL 构建的视觉-语言-操作基础模型,通过跨表示、运动和行为维度的统一对齐框架,实现大规模多源训练的一致性。仅利用开源数据集和人类视频(无需专有数据),构建约 38,100 小时预训练语料,展现出零样本指令跟随、扰动鲁棒、错误恢复及跨本体迁移等涌现能力。在 RoboCasa365、LIBERO-Plus、EBench、RoboTwin 系列等 OOD 评测上全面超越先前 SOTA(包括 π0.5),在 RoboChallenge 排名第一且相对提升 20%,并在 AgileX ALOHA、Franka、UR、ARX 等真实机器人平台上得到验证。
Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including π0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.