Qwen-Robot Suite：面向物理世界智能的基础模型套件

2026-06-16 10:00·8天前·QwenTeam

AI 摘要

Qwen 发布三款基础模型——Qwen-RobotNav、Qwen-RobotManip 和 Qwen-RobotWorld。Nav 通过可控观测协议统一指令跟随、点/物体目标导航、目标追踪和自动驾驶五类任务，在 VLN-CE RxR 上达 76.5% SR，HM3Dv2 物体目标导航（仅 RGB）75.6% SR，EVT-Bench 追踪率 90.0%，NAVSIM 91.4 PDMS。Manip 利用规范状态-动作空间对超 38,100 小时异构开源机器人数据进行跨本体训练。World 通过自然语言动作接口协同训练 20 余种本体，预测操控、驾驶和导航的物理未来。三者共同将通用智能转化为物理行动。

原文 · 未翻译

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence | Qwen

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence

2026/06/16 · 13 minute · 2590 words · QwenTeam丨Translations:简体中文

TL;DR The Qwen family of foundation models already gives strong perception and reasoning about the physical world. But seeing is not acting: the gap between vision and language understanding and physical control remains the central bottleneck for embodied intelligence. The Qwen-Robot Suite bridges this gap with three foundation models — Qwen-RobotNav, Qwen-RobotManip, and Qwen-RobotWorld. Nav unifies five navigation task families through a controllable observation protocol. Manip turns heterogeneous robot data into a coherent canonical space, enabling cross-embodiment training at scale. World co-trains 20+ embodiments via a natural-language action interface under one world model. Together, they enable an agentic system where general intelligence translates directly into physical action.

The Qwen family of multimodal foundation models has made remarkable progress in understanding the physical world. Qwen-VL can parse complex spatial relationships, identify objects in cluttered scenes, follow multi-step visual instructions, and reason about physical configurations, giving physical agents a preliminary cognitive foundation. A VLM can already plan in language: “go to the kitchen, find the red cup, pick it up, and place it on the shelf.”

But understanding the physical world is not the same as acting in it. A VLM that can plan those steps cannot produce the motor commands that execute them. This is fundamentally an alignment challenge that language instructions and physical action signals live in different representation spaces, and bridging them requires more than perception alone. What makes this harder is that the embodied data needed to close this gap is fundamentally unlike internet text. It is heterogeneous by nature, expensive to collect, and narrow in diversity. A navigation trajectory, a tele-operated grasp, and a dashcam clip live in incompatible action spaces, observation formats, and embodiments. Naively pooling them produces conflict rather than synergy.

The Qwen-Robot Suite bridges this gap with three foundation models — Qwen-RobotNav, Qwen-RobotManip, and Qwen-RobotWorld — each aligning language with a different domain of physical action. We pursue generalization across language instructions and adherence to physical laws, and have achieved significant progress on both fronts. In this post, we also explore the potential of these models as low-level tools for building general-purpose agentic systems.

Qwen-RobotNav: The Gateway to Mobility for Physical Agents — bridges the vision-language representation space into mobility actions through controllable observation encoding and a tool interface for agentic systems, unifying instruction following, point/object-goal navigation, target tracking, and autonomous driving. Qwen-RobotManip: The Foundation of Interaction for Physical Agents — bridges the vision-language representation space into manipulation actions through a canonical state-action space and camera-frame delta poses, enabling coherent cross-embodiment training on a >38,100-hour open-source corpus. Qwen-RobotWorld: Infinite Worlds for Physical Agents — bridges the vision-language representation space into world dynamics through a natural-language action interface, enabling a single world model to predict physically grounded futures across manipulation, driving, and navigation.

Each has its own technical report and deep-dive blog. This post tells the story of how they fit together.

Qwen-RobotNav: The Gateway to Physical Mobility#

NAVIGATE

Video 1

Video 2

Video 3

Before an agent can manipulate anything, it has to get there. Mobile navigation spans tasks with fundamentally different memory requirements: instruction following demands long-horizon context while target tracking cares almost entirely about recent frames. No fixed observation strategy serves both.

Qwen-RobotNav, built on Qwen3-VL, addresses this through a parameterised navigation interface with two complementary dimensions: task modes that select the navigation behaviour (instruction following, object search, target tracking, autonomous driving), and controllable observation parameters (token budget, temporal decay, per-camera weights, frame sample mode) that govern how visual history is encoded. Trained on 15.6 million samples with co-training on vision-language data to preserve grounded perception, Qwen-RobotNav unifies five task families under a single set of weights. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems. An upper-level planner (Qwen3.7-Plus) decomposes long-horizon goals into sub-tasks and dynamically switches Qwen-RobotNav’s task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. This extends the system to long-horizon reasoning with persistent memory, enabling it to solve complex user intents that require multi-step navigation, evidence gathering, and grounded response generation.

Highlights:

5 Domains

8 SOTAs

One Model Unifies VLN, ObjNav,

Tracking, Driving & EQA

Context

= Interface

4-Axis Observation Protocol

Zero Architecture Change

Agentic

Navigation as a Tool Call

+ Two-Level Memory

Raises the new bar on EQA

Generalization

In-the-Wild Single-Camera

Deployment in Unseen Environments

Unified Multi-Domains Navigation: a single model with one set of weights achieves state-of-the-art across 5 navigation domains: 76.5% SR on VLN-CE RxR, 75.6% SR on HM3Dv2 object-goal (RGB only, surpassing depth-based methods), 90.0% tracking rate on EVT-Bench, 91.4 PDMS on NAVSIM, and new bests on 3 EQA benchmarks, with consistent scaling from 2B to 8B parameters. Controllable Observation Protocol: four axes (visual token budget, temporal decay, per-camera weighting, frame sample mode) are exposed as inference-time parameters and randomized per sample at training time, enabling any inference-time configuration without retraining or architectural modification. Agentic Navigation System: designed as a reconfigurable navigation primitive within a two-tier system, where an upper-level planner (Qwen3.7-Plus) decomposes long-horizon goals and dispatches configurable navigation calls while maintaining two-level memory, achieving +15.4% on EXPRESS-Bench with 77% fewer navigation steps over prior best. In-the-Wild Generalization: deployed zero-shot on a Unitree Go2 quadruped with its single low-resolution build-in camera, demonstrating strong generalization to in-the-wild environments and unconstrained natural-language instructions without any environment-specific fine-tuning.

Benchmarking Qwen-RobotNav

Deployment

Deployed zero-shot on a Unitree Go2 quadruped (NVIDIA Jetson Thor, 196ms latency) using only the built-in low-resolution camera. The robot executes step-by-step verbal instructions across multiple rooms in a previously unseen apartment.

Video 4

Video 5

Video 6

Video 7

Instruction Following

We evaluate a back-and-forth navigation task in an unseen exhibition hall: the robot first navigates 21.78 m from a living room to a hospital room following language instructions, then receives a reverse command and must precisely retrace the entire route. This is particularly challenging as it requires the model to maintain spatial awareness over long distances, ground diverse visual landmarks in both forward and reverse directions, and execute accurate bidirectional position control purely from language.

Video 8

Video 9

Cross Embodiment

A single set of weights serves both legged robot navigation and autonomous driving. On NAVSIM closed-loop driving, Qwen-RobotNav-4B achieves 91.4 PDMS.

Video 10

Video 11

Read the Qwen-RobotNav blog →

Qwen-RobotManip: The Foundation of Physical Interaction#

MANIPULATE

Video 12

Video 13

Video 14

Physical agents need to interact with the real world — for example, completing manipulation tasks with robot arms. Yet an industrial arm on a production line and a service arm in a kitchen may perform visually similar grasping motions while having entirely different joint configurations and action spaces. The core challenge is making heterogeneous embodiments representationally compatible, so that scaling across robots and data sources produces synergy rather than conflict.

Qwen-RobotManip, built on Qwen3.5-4B VL with a flow-matching DiT action head, introduces three mechanisms to solve this. A unified 80-dimensional state-action representation is shared across single-arm, dual-arm, dexterous-hand, and mobile embodiments. Camera-frame end-effector delta pose actions make visually similar motions numerically proximate across robots, abstracting away morphological differences. In-context policy adaptation reads execution history as an implicit embodiment signature for on-the-fly adaptation.

Once the representation framework is unified, the data barrier drops. We train the VLA model on 11,320 hours of open-source robot data, 1,933 hours of open-source egocentric human video, and 24,808 hours of robot demonstrations across 15 embodiments synthesized from the human video via our Human-to-Robot synthesis pipeline — totaling >38,100 hours. Using only open-source data, the model already exhibits emergent generalization capabilities, including robustness to perturbations, zero-shot instruction following, reactive error recovery, and cross-embodiment transfer.

Highlights:

Alignment

Representation · Motion · Behavior

Three-Dimensional Alignment

Open-Source Data Only

38K Hours of Manipulation Data

Across 15 Embodiments

Dominant

OOD Generalization

Across All Benchmarks

RoboChallenge Table30 v1 Generalist Track

Sweeping Top 2, 20% Ahead of 3rd Place

Unified Cross-Embodiment Alignment Framework — a unified 80-dimensional state-action representation accommodates diverse embodiments, camera-frame end-effector delta poses make visually similar motions numerically proximate, and in-context policy adaptation reads execution history as an implicit embodiment identifier — together enabling consistent signal extraction across embodiments Human-to-Robot Synthesis at Scale — a pipeline converting 1,933h of egocentric human video into 24,808h of robot demonstrations across 15 embodiments via action retargeting, hand removal and inpainting, simulated rendering, and depth-guided compositing, coupled with a multi-stage curation pipeline ensuring data quality OOD Generalization: LIBERO-Plus 91.4% (+7.0 over π0.5), RoboTwin-C2R Hard 69.4% (+21.5 over π0.5), RoboCasa365 Composite-Unseen 14.9% (3× next best), EBench 45.6% (+18.5 over next best); RoboTwin-IF 72.0% (+22.4 over π0.5) confirming genuine language-conditioned control; 3× next best on RoboTwin-XE showing zero-shot cross-embodiment transfer Strong Real-World Performance: #1 on RoboChallenge Table30 v1 generalist track with 45% SR, sweeping top 2 and leading 3rd place by 20%; validated on real-robot platforms with 2× prior SOTA on in-domain and OOD tasks, few-shot adaptation, and cross-embodiment skill transfer

Key Finding — Alignment is the prerequisite for scale.Only models with unified cross-embodiment representations (UnifiedSpace + UnifiedEEF) exhibit clean log-linear data scaling. Without alignment, adding more data produces erratic or flat curves — scale cannot compensate for a broken formulation.

Diverse Real-World Tasks

A single generalist policy handles complex manipulation across diverse task categories, scenes, and objects.

Video 15

Video 16

Video 17

Video 18

Video 19

Video 20

Video 21

Video 22

Instruction Following

Following diverse unseen instructions across real-world settings (top row) and simulation (bottom row).

Video 23

Video 24

Video 25

Video 26

Video 27

Video 28

Video 29

Video 30

Cross-Embodiment Transfer

Tasks trained on other embodiments transfer zero-shot to new ones (top row); few-shot demonstrations enable rapid adaptation to entirely new tasks (bottom row).

Video 31

Video 32

Video 33

Video 34

Video 35

Video 36

Video 37

Video 38

Read the Qwen-RobotManip blog →

Qwen-RobotWorld: Infinite Robotics Worlds#

IMAGINE Real-world experience is the scarcest resource in robotics. Qwen-RobotWorld addresses this by learning the world’s state transition function directly: given the current observation and a natural-language action, it predicts what the world will look like next. The key design choice is expressing all actions in natural language — this converts end-effector poses, steering commands, and navigation waypoints into a single interface, enabling 20+ embodiment types and 500+ action categories to be co-trained under the Embodied World Knowledge corpus (8.6M video-text pairs, 200M+ frames). A 60-layer dual-stream MMDiT couples Qwen2.5-VL’s semantic representations with video latents. Using a full multimodal LLM as the action encoder — rather than a lightweight text encoder — is load-bearing: it brings internalized world knowledge that arms are rigid bodies, fluids spread, and objects fall, implicitly constraining generation toward physically plausible futures. Each domain reinforces the others: manipulation teaches contact physics, driving teaches 3D geometry, navigation teaches room-scale spatial reasoning.

Highlights:

Top-Tier

Across

4 Benchmarks

20+

Robot Embodiments

Unified

8.6M

Cross-Scenario

Training Pairs

1300+

Manipulation

Skills

Language-Driven Unified Action Interface — natural language standardizes 20+ robot embodiments and 500+ action categories into one training interface, enabling manipulation, driving, navigation, and human-to-robot transfer to be jointly trained; each domain reinforces the others Dual-Stream MMDiT + Qwen2.5-VL Action Encoder — a full multimodal LLM as action encoder (not a lightweight text encoder) parses complex compositional instructions into precise generation signals with internalized physical world knowledge, serving as a synthetic data engine, closed-loop policy evaluator, and action planner Rankings: 1st overall on EWMBench (motion fidelity +33% over runner-up) and DreamGen Bench; 1st open-source on WorldModelBench (perfect physics adherence on Newton’s laws, mass conservation, fluid dynamics) and PBBench Capabilities: Fine-grained language grounding (changing one keyword → different future); human-to-robot transfer across 8+ embodiments with multi-view consistent generation; zero-shot robustness on RoboTwin-IF

Any Instruction Following

Changing a single keyword — object, destination, or action verb — produces a correspondingly different future. The world model truly understands language, not just pattern-matches.

Different Object

Video 39

Pick up red strawberry

⇄

Video 40

Pick up yellow potato

Different Embodiment

Video 41

Assemble camera parts

⇄

Video 42

Assemble camera parts

Different Destination

Video 43

Place pen on wooden tray

⇄

Video 44

Place pen on white paper

Different Action

Video 45

Extend glue forward

⇄

Video 46

Place glue into penholder

Multi-View Consistent Generation

Given a single instruction, Qwen-RobotWorld generates temporally and spatially consistent videos across multiple camera viewpoints — critical for sim-to-real transfer and multi-camera policy training.

Video 47

Video 48

Video 49

Video 50

View More Multi-View Examples (16 more)

Video 51

Video 52

Video 53

Video 54

Video 55

Video 56

Video 57

Video 58

Video 59

Video 60

Video 61

Video 62

Video 63

Video 64

Video 65

Video 66

Human → Robot Transfer

Given a human demonstration, Qwen-RobotWorld generates realistic robot execution across diverse embodiments — no teleoperation required.

ARX-L5

Video 67

Human Demo

→

Video 68

Robot Execution

xArm7

Video 69

Human Demo

→

Video 70

Robot Execution

Franka Panda

Video 71

Human Demo

→

Video 72

Robot Execution

Sawyer

Video 73

Human Demo

→

Video 74

Robot Execution

View More Embodiments (Kinova Gen3, Piper, KUKA iiwa, Kinova Jaco)

Kinova Gen3

Video 75

Human Demo

→

Video 76

Robot Execution

Piper

Video 77

Human Demo

→

Video 78

Robot Execution

KUKA iiwa

Video 79

Human Demo

→

Video 80

Robot Execution

Kinova Jaco

Video 81

Human Demo

→

Video 82

Robot Execution

Autonomous Driving & Indoor Navigation

Driving teaches large-scale 3D geometry and multi-agent dynamics; navigation teaches room-scale spatial reasoning. Each domain reinforces the others.

Driving

Video 83

Video 84

Video 85

Video 86

Indoor Navigation

Video 87

Video 88

Video 89

Video 90

Read the Qwen-RobotWorld blog →

From Models to Agents: Closing the Loop#

Each model is independently useful — but because all three expose language-first interfaces, general-purpose Qwen models can compose with them as physical-world tools, connecting general intelligence to physical action. We have an in-hourse project Qwen-RobotClaw, a robotics agent harness that allows Qwen VLM agents to call Qwen-Robot Suite models as physical-world tools while properly managing the context and memory required by long-horizon tasks, pushing physical intelligence toward more general and more complex real-world applications. Here are early examples of what this makes possible.

Open-Ended Task Execution

Qwen-Omni observes the scene, randomly proposes manipulation tasks via speech, and judges execution in real time. Each video shows Qwen-RobotManip completing tasks on the fly with no pre-defined task list — demonstrating that a general-purpose multimodal model can serve as the task proposer and evaluator, while the suite model handles physical execution.

Video 91

Video 92

Long-Horizon Manipulation

We develop a VLM-driven agentic VLA system, in which the base Qwen-3.5 model serves as the high-level planner and Qwen-RobotManip handles low-level execution. Leveraging its capabilities in scene understanding, spatial reasoning, and task progress assessment, Qwen-3.5 decomposes a complex high-level instruction into a sequence of atomic subtasks, which are then executed by the VLA. This division of labor significantly improves robustness to OOD scenes and instructions.

We illustrate this with a table-cleaning task on a cluttered tabletop with a basket. Given such a fully OOD scene and abstract instruction, directly using the VLA model exhibits clearly abnormal behavior (right). With Qwen-3.5 as the planner, the system decomposes the task into fine-grained atomic subtasks in real time (left), allowing the VLA to focus on one simple step at a time and demonstrating compositional generalization.

We also observe that subtask decomposition helps the system recover from failure-and-retry loops. When the high-level VLM detects that execution has stalled, it replans by issuing a new subtask, enabling the system to resume progress and ultimately complete the task successfully.

Video 93

Agent + VLA (success)

Video 94

VLA alone (failure)

Agentic Navigation & Embodied QA

By combining the agent system with Qwen-RobotNav, we achieve a substantial improvement over previous state of the art on long-horizon 3D physical-world exploration tasks, including Embodied Question Answering benchmarks such as HM-EQA, MT-HM3D, and EXPRESS-Bench. We can also deploy the same open-world exploration capability in real environments, as shown in the demos below. We will release more technical details in future updates.

In the first demo, the user asks the agent to find an open restroom in a real building. The agent scans the environment, follows corridor-level cues to search for restroom signage, discovers that the first restroom is unavailable due to a visible "Cleaning in Progress / 暂停使用" sign, and then replans to search for an alternative on the other side of the building. After verifying from visual evidence that the second restroom is open and accessible, it returns an evidence-grounded answer.

Video 95

In the second demo, which was previously released together with Qwen3.7-Max, the agent calls Qwen-RobotNav to autonomously explore an open campus environment, correct its deviations along the way, and finally recover a lost umbrella.

Video 96

Chat2Robot

We provide an experimental feature — Chat2Robot — where you can chat with a robot directly in your browser. Simply type a natural-language instruction, and watch the robot respond in real time. Give it a try and experience Qwen-Robot Suite in action!

Note: Chat2Robot currently supports Qwen-RobotManip only. The deployed policy is trained solely on the RoboTwin-Clean dataset, which contains only 50 tasks — it is not a perfect policy. Our goal here is to demonstrate a degree of zero-shot instruction following capability. This feature is still under active development and may not be fully polished — we welcome your feedback and suggestions!

We thank D-Robotics (Digua) for supporting this feature.

What’s Next#

Physical world intelligence is still in its infancy. Contact-rich long-horizon tasks, lifelong learning, tighter integration between general-purpose planners and physical-world executors, and richer human-robot-environment interaction all remain open. But the path is becoming clear: start from strong multimodal understanding, bridge the vision-language representation space into each type of physical action, scale the training, and demand generalization.

A physical agent that can go anywhere, do anything, and foresee what comes next.

That is the destination — and the Qwen-Robot Suite is our first full step towards it.

Citation#

bibtex

@article{qwenrobotnav, title={Qwen-RobotNav: A Scalable Navigation Model Designed for an Agentic Navigation System}, author={Qwen Team}, year={2026}}@article{qwenrobotmanip, title={Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models}, author={Qwen Team}, year={2026}}@article{qwenrobotworld, title={Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation}, author={Qwen Team}, year={2026}}

Try Qwen Studio

Web

iOS

Android

macOS

Windows

Qwen Studio

Qwen Studio Overview

Download

API Platform

Our Flagship Models

Platform Overview

API Platform

Qwen Cloud

Research

Latest Advancements

Research Index

GitHub

Terms & Policies

Usage Policy

Cookies Notice

Training Data Summary

具身智能模型发布

Qwen：Blog Retrieval（API）

Qwen-Robot Suite：面向物理世界智能的基础模型套件

2026-06-16 10:00·8天前·QwenTeam

AI 摘要

原文 · 保持原样，未翻译

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence | Qwen

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence

2026/06/16 · 13 minute · 2590 words · QwenTeam丨Translations:简体中文