原文 · 未翻译
Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence | Qwen
Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence
2026/06/16 · 13 minute · 2590 words · QwenTeam丨Translations:简体中文
TL;DR The Qwen family of foundation models already gives strong perception and reasoning about the physical world. But seeing is not acting: the gap between vision and language understanding and physical control remains the central bottleneck for embodied intelligence. The Qwen-Robot Suite bridges this gap with three foundation models — Qwen-RobotNav, Qwen-RobotManip, and Qwen-RobotWorld. Nav unifies five navigation task families through a controllable observation protocol. Manip turns heterogeneous robot data into a coherent canonical space, enabling cross-embodiment training at scale. World co-trains 20+ embodiments via a natural-language action interface under one world model. Together, they enable an agentic system where general intelligence translates directly into physical action.
The Qwen family of multimodal foundation models has made remarkable progress in understanding the physical world. Qwen-VL can parse complex spatial relationships, identify objects in cluttered scenes, follow multi-step visual instructions, and reason about physical configurations, giving physical agents a preliminary cognitive foundation. A VLM can already plan in language: “go to the kitchen, find the red cup, pick it up, and place it on the shelf.”
But understanding the physical world is not the same as acting in it. A VLM that can plan those steps cannot produce the motor commands that execute them. This is fundamentally an alignment challenge that language instructions and physical action signals live in different representation spaces, and bridging them requires more than perception alone. What makes this harder is that the embodied data needed to close this gap is fundamentally unlike internet text. It is heterogeneous by nature, expensive to collect, and narrow in diversity. A navigation trajectory, a tele-operated grasp, and a dashcam clip live in incompatible action spaces, observation formats, and embodiments. Naively pooling them produces conflict rather than synergy.
The Qwen-Robot Suite bridges this gap with three foundation models — Qwen-RobotNav, Qwen-RobotManip, and Qwen-RobotWorld — each aligning language with a different domain of physical action. We pursue generalization across language instructions and adherence to physical laws, and have achieved significant progress on both fronts. In this post, we also explore the potential of these models as low-level tools for building general-purpose agentic systems.
Qwen-RobotNav: The Gateway to Mobility for Physical Agents — bridges the vision-language representation space into mobility actions through controllable observation encoding and a tool interface for agentic systems, unifying instruction following, point/object-goal navigation, target tracking, and autonomous driving. Qwen-RobotManip: The Foundation of Interaction for Physical Agents — bridges the vision-language representation space into manipulation actions through a canonical state-action space and camera-frame delta poses, enabling coherent cross-embodiment training on a >38,100-hour open-source corpus. Qwen-RobotWorld: Infinite Worlds for Physical Agents — bridges the vision-language representation space into world dynamics through a natural-language action interface, enabling a single world model to predict physically grounded futures across manipulation, driving, and navigation.
Each has its own technical report and deep-dive blog. This post tells the story of how they fit together.
Qwen-RobotNav: The Gateway to Physical Mobility#
NAVIGATE
Video 1
Video 2
Video 3
Before an agent can manipulate anything, it has to get there. Mobile navigation spans tasks with fundamentally different memory requirements: instruction following demands long-horizon context while target tracking cares almost entirely about recent frames. No fixed observation strategy serves both.
Qwen-RobotNav, built on Qwen3-VL, addresses this through a parameterised navigation interface with two complementary dimensions: task modes that select the navigation behaviour (instruction following, object search, target tracking, autonomous driving), and controllable observation parameters (token budget, temporal decay, per-camera weights, frame sample mode) that govern how visual history is encoded. Trained on 15.6 million samples with co-training on vision-language data to preserve grounded perception, Qwen-RobotNav unifies five task families under a single set of weights. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems. An upper-level planner (Qwen3.7-Plus) decomposes long-horizon goals into sub-tasks and dynamically switches Qwen-RobotNav’s task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. This extends the system to long-horizon reasoning with persistent memory, enabling it to solve complex user intents that require multi-step navigation, evidence gathering, and grounded response generation.
Highlights:
5 Domains
8 SOTAs
One Model Unifies VLN, ObjNav,
Tracking, Driving & EQA
Context
= Interface
4-Axis Observation Protocol
Zero Architecture Change
Agentic
Navigation as a Tool Call
+ Two-Level Memory
Raises the new bar on EQA
Generalization
In-the-Wild Single-Camera
Deployment in Unseen Environments
Unified Multi-Domains Navigation: a single model with one set of weights achieves state-of-the-art across 5 navigation domains: 76.5% SR on VLN-CE RxR, 75.6% SR on HM3Dv2 object-goal (RGB only, surpassing depth-based methods), 90.0% tracking rate on EVT-Bench, 91.4 PDMS on NAVSIM, and new bests on 3 EQA benchmarks, with consistent scaling from 2B to 8B parameters. Controllable Observation Protocol: four axes (visual token budget, temporal decay, per-camera weighting, frame sample mode) are exposed as inference-time parameters and randomized per sample at training time, enabling any inference-time configuration without retraining or architectural modification. Agentic Navigation System: designed as a reconfigurable navigation primitive within a two-tier system, where an upper-level planner (Qwen3.7-Plus) decomposes long-horizon goals and dispatches configurable navigation calls while maintaining two-level memory, achieving +15.4% on EXPRESS-Bench with 77% fewer navigation steps over prior best. In-the-Wild Generalization: deployed zero-shot on a Unitree Go2 quadruped with its single low-resolution build-in camera, demonstrating strong generalization to in-the-wild environments and unconstrained natural-language instructions without any environment-specific fine-tuning.
Benchmarking Qwen-RobotNav
Deployment
Deployed zero-shot on a Unitree Go2 quadruped (NVIDIA Jetson Thor, 196ms latency) using only the built-in low-resolution camera. The robot executes step-by-step verbal instructions across multiple rooms in a previously unseen apartment.
Video 4
Video 5
Video 6
Video 7
Instruction Following
We evaluate a back-and-forth navigation task in an unseen exhibition hall: the robot first navigates 21.78 m from a living room to a hospital room following language instructions, then receives a reverse command and must precisely retrace the entire route. This is particularly challenging as it requires the model to maintain spatial awareness over long distances, ground diverse visual landmarks in both forward and reverse directions, and execute accurate bidirectional position control purely from language.
Video 8
Video 9
Cross Embodiment
A single set of weights serves both legged robot navigation and autonomous driving. On NAVSIM closed-loop driving, Qwen-RobotNav-4B achieves 91.4 PDMS.
Video 10
Video 11
Read the Qwen-RobotNav blog →
Qwen-RobotManip: The Foundation of Physical Interaction#
MANIPULATE
Video 12
Video 13
Video 14
Physical agents need to interact with the real world — for example, completing manipulation tasks with robot arms. Yet an industrial arm on a production line and a service arm in a kitchen may perform visually similar grasping motions while having entirely different joint configurations and action spaces. The core challenge is making heterogeneous embodiments representationally compatible, so that scaling across robots and data sources produces synergy rather than conflict.
Qwen-RobotManip, built on Qwen3.5-4B VL with a flow-matching DiT action head, introduces three mechanisms to solve this. A unified 80-dimensional state-action representation is shared across single-arm, dual-arm, dexterous-hand, and mobile embodiments. Camera-frame end-effector delta pose actions make visually similar motions numerically proximate across robots, abstracting away morphological differences. In-context policy adaptation reads execution history as an implicit embodiment signature for on-the-fly adaptation.
Once the representation framework is unified, the data barrier drops. We train the VLA model on 11,320 hours of open-source robot data, 1,933 hours of open-source egocentric human video, and 24,808 hours of robot demonstrations across 15 embodiments synthesized from the human video via our Human-to-Robot synthesis pipeline — totaling >38,100 hours. Using only open-source data, the model already exhibits emergent generalization capabilities, including robustness to perturbations, zero-shot instruction following, reactive error recovery, and cross-embodiment transfer.
Highlights:
Alignment
Representation · Motion · Behavior
Three-Dimensional Alignment
Open-Source Data Only
38K Hours of Manipulation Data
Across 15 Embodiments
Dominant
OOD Generalization
Across All Benchmarks
#1
RoboChallenge Table30 v1 Generalist Track
Sweeping Top 2, 20% Ahead of 3rd Place
Unified Cross-Embodiment Alignment Framework — a unified 80-dimensional state-action representation accommodates diverse embodiments, camera-frame end-effector delta poses make visually similar motions numerically proximate, and in-context policy adaptation reads execution history as an implicit embodiment identifier — together enabling consistent signal extraction across embodiments Human-to-Robot Synthesis at Scale — a pipeline converting 1,933h of egocentric human video into 24,808h of robot demonstrations across 15 embodiments via action retargeting, hand removal and inpainting, simulated rendering, and depth-guided compositing, coupled with a multi-stage curation pipeline ensuring data quality OOD Generalization: LIBERO-Plus 91.4% (+7.0 over π0.5), RoboTwin-C2R Hard 69.4% (+21.5 over π0.5), RoboCasa365 Composite-Unseen 14.9% (3× next best), EBench 45.6% (+18.5 over next best); RoboTwin-IF 72.0% (+22.4 over π0.5) confirming genuine language-conditioned control; 3× next best on RoboTwin-XE showing zero-shot cross-embodiment transfer Strong Real-World Performance: #1 on RoboChallenge Table30 v1 generalist track with 45% SR, sweeping top 2 and leading 3rd place by 20%; validated on real-robot platforms with 2× prior SOTA on in-domain and OOD tasks, few-shot adaptation, and cross-embodiment skill transfer
Key Finding — Alignment is the prerequisite for scale.Only models with unified cross-embodiment representations (UnifiedSpace + UnifiedEEF) exhibit clean log-linear data scaling. Without alignment, adding more data produces erratic or flat curves — scale cannot compensate for a broken formulation.
Diverse Real-World Tasks
A single generalist policy handles complex manipulation across diverse task categories, scenes, and objects.
Video 15
Video 16
Video 17
Video 18
Video 19
Video 20
Video 21
Video 22
Instruction Following
Following diverse unseen instructions across real-world settings (top row) and simulation (bottom row).
Video 23
Video 24
Video 25
Video 26
Video 27
Video 28
Video 29
Video 30
Cross-Embodiment Transfer
Tasks trained on other embodiments transfer zero-shot to new ones (top row); few-shot demonstrations enable rapid adaptation to entirely new tasks (bottom row).
Video 31
Video 32
Video 33
Video 34
Video 35
Video 36
Video 37
Video 38
Read the Qwen-RobotManip blog →
Qwen-RobotWorld: Infinite Robotics Worlds#
IMAGINE Real-world experience is the scarcest resource in robotics. Qwen-RobotWorld addresses this by learning the world’s state transition function directly: given the current observation and a natural-language action, it predicts what the world will look like next. The key design choice is expressing all actions in natural language — this converts end-effector poses, steering commands, and navigation waypoints into a single interface, enabling 20+ embodiment types and 500+ action categories to be co-trained under the Embodied World Knowledge corpus (8.6M video-text pairs, 200M+ frames). A 60-layer dual-stream MMDiT couples Qwen2.5-VL’s semantic representations with video latents. Using a full multimodal LLM as the action encoder — rather than a lightweight text encoder — is load-bearing: it brings internalized world knowledge that arms are rigid bodies, fluids spread, and objects fall, implicitly constraining generation toward physically plausible futures. Each domain reinforces the others: manipulation teaches contact physics, driving teaches 3D geometry, navigation teaches room-scale spatial reasoning.
Highlights:
Top-Tier
Across
4 Benchmarks
20+
Robot Embodiments
Unified
8.6M
Cross-Scenario
Training Pairs
1300+
Manipulation
Skills
Language-Driven Unified Action Interface — natural language standardizes 20+ robot embodiments and 500+ action categories into one training interface, enabling manipulation, driving, navigation, and human-to-robot transfer to be jointly trained; each domain reinforces the others Dual-Stream MMDiT + Qwen2.5-VL Action Encoder — a full multimodal LLM as action encoder (not a lightweight text encoder) parses complex compositional instructions into precise generation signals with internalized physical world knowledge, serving as a synthetic data engine, closed-loop policy evaluator, and action planner Rankings: 1st overall on EWMBench (motion fidelity +33% over runner-up) and DreamGen Bench; 1st open-source on WorldModelBench (perfect physics adherence on Newton’s laws, mass conservation, fluid dynamics) and PBBench Capabilities: Fine-grained language grounding (changing one keyword → different future); human-to-robot transfer across 8+ embodiments with multi-view consistent generation; zero-shot robustness on RoboTwin-IF
Any Instruction Following
Changing a single keyword — object, destination, or action verb — produces a correspondingly different future. The world model truly understands language, not just pattern-matches.
Different Object
Video 39
Pick up red strawberry
⇄
Video 40
Pick up yellow potato
Different Embodiment
Video 41
Assemble camera parts
⇄
Video 42
Assemble camera parts
Different Destination
Video 43
Place pen on wooden tray
⇄
Video 44
Place pen on white paper
Different Action
Video 45
Extend glue forward
⇄
Video 46
Place glue into penholder
Multi-View Consistent Generation
Given a single instruction, Qwen-RobotWorld generates temporally and spatially consistent videos across multiple camera viewpoints — critical for sim-to-real transfer and multi-camera policy training.
Video 47
Video 48
Video 49
Video 50
View More Multi-View Examples (16 more)
Video 51
Video 52
Video 53
Video 54
Video 55
Video 56
Video 57
Video 58
Video 59
Video 60
Video 61
Video 62
Video 63
Video 64
Video 65
Video 66
Human → Robot Transfer
Given a human demonstration, Qwen-RobotWorld generates realistic robot execution across diverse embodiments — no teleoperation required.
ARX-L5
Video 67
Human Demo
→
Video 68
Robot Execution
xArm7
Video 69
Human Demo
→
Video 70
Robot Execution
Franka Panda
Video 71
Human Demo
→
Video 72
Robot Execution
Sawyer
Video 73
Human Demo
→
Video 74
Robot Execution
View More Embodiments (Kinova Gen3, Piper, KUKA iiwa, Kinova Jaco)
Kinova Gen3
Video 75
Human Demo
→
Video 76
Robot Execution
Piper
Video 77
Human Demo
→
Video 78
Robot Execution
KUKA iiwa
Video 79
Human Demo
→
Video 80
Robot Execution
Kinova Jaco
Video 81
Human Demo
→
Video 82
Robot Execution
Autonomous Driving & Indoor Navigation
Driving teaches large-scale 3D geometry and multi-agent dynamics; navigation teaches room-scale spatial reasoning. Each domain reinforces the others.
Driving
Video 83
Video 84
Video 85
Video 86
Indoor Navigation
Video 87
Video 88
Video 89
Video 90
Read the Qwen-RobotWorld blog →
From Models to Agents: Closing the Loop#
Each model is independently useful — but because all three expose language-first interfaces, general-purpose Qwen models can compose with them as physical-world tools, connecting general intelligence to physical action. We have an in-hourse project Qwen-RobotClaw, a robotics agent harness that allows Qwen VLM agents to call Qwen-Robot Suite models as physical-world tools while properly managing the context and memory required by long-horizon tasks, pushing physical intelligence toward more general and more complex real-world applications. Here are early examples of what this makes possible.
Open-Ended Task Execution
Qwen-Omni observes the scene, randomly proposes manipulation tasks via speech, and judges execution in real time. Each video shows Qwen-RobotManip completing tasks on the fly with no pre-defined task list — demonstrating that a general-purpose multimodal model can serve as the task proposer and evaluator, while the suite model handles physical execution.
Video 91
Video 92
Long-Horizon Manipulation
We develop a VLM-driven agentic VLA system, in which the base Qwen-3.5 model serves as the high-level planner and Qwen-RobotManip handles low-level execution. Leveraging its capabilities in scene understanding, spatial reasoning, and task progress assessment, Qwen-3.5 decomposes a complex high-level instruction into a sequence of atomic subtasks, which are then executed by the VLA. This division of labor significantly improves robustness to OOD scenes and instructions.
We illustrate this with a table-cleaning task on a cluttered tabletop with a basket. Given such a fully OOD scene and abstract instruction, directly using the VLA model exhibits clearly abnormal behavior (right). With Qwen-3.5 as the planner, the system decomposes the task into fine-grained atomic subtasks in real time (left), allowing the VLA to focus on one simple step at a time and demonstrating compositional generalization.
We also observe that subtask decomposition helps the system recover from failure-and-retry loops. When the high-level VLM detects that execution has stalled, it replans by issuing a new subtask, enabling the system to resume progress and ultimately complete the task successfully.
Video 93
Agent + VLA (success)
Video 94
VLA alone (failure)
Agentic Navigation & Embodied QA
By combining the agent system with Qwen-RobotNav, we achieve a substantial improvement over previous state of the art on long-horizon 3D physical-world exploration tasks, including Embodied Question Answering benchmarks such as HM-EQA, MT-HM3D, and EXPRESS-Bench. We can also deploy the same open-world exploration capability in real environments, as shown in the demos below. We will release more technical details in future updates.
In the first demo, the user asks the agent to find an open restroom in a real building. The agent scans the environment, follows corridor-level cues to search for restroom signage, discovers that the first restroom is unavailable due to a visible "Cleaning in Progress / 暂停使用" sign, and then replans to search for an alternative on the other side of the building. After verifying from visual evidence that the second restroom is open and accessible, it returns an evidence-grounded answer.
Video 95
In the second demo, which was previously released together with Qwen3.7-Max, the agent calls Qwen-RobotNav to autonomously explore an open campus environment, correct its deviations along the way, and finally recover a lost umbrella.
Video 96
Chat2Robot
We provide an experimental feature — Chat2Robot — where you can chat with a robot directly in your browser. Simply type a natural-language instruction, and watch the robot respond in real time. Give it a try and experience Qwen-Robot Suite in action!
Note: Chat2Robot currently supports Qwen-RobotManip only. The deployed policy is trained solely on the RoboTwin-Clean dataset, which contains only 50 tasks — it is not a perfect policy. Our goal here is to demonstrate a degree of zero-shot instruction following capability. This feature is still under active development and may not be fully polished — we welcome your feedback and suggestions!
We thank D-Robotics (Digua) for supporting this feature.
What’s Next#
Physical world intelligence is still in its infancy. Contact-rich long-horizon tasks, lifelong learning, tighter integration between general-purpose planners and physical-world executors, and richer human-robot-environment interaction all remain open. But the path is becoming clear: start from strong multimodal understanding, bridge the vision-language representation space into each type of physical action, scale the training, and demand generalization.
A physical agent that can go anywhere, do anything, and foresee what comes next.
That is the destination — and the Qwen-Robot Suite is our first full step towards it.
Citation#
bibtex
@article{qwenrobotnav, title={Qwen-RobotNav: A Scalable Navigation Model Designed for an Agentic Navigation System}, author={Qwen Team}, year={2026}}@article{qwenrobotmanip, title={Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models}, author={Qwen Team}, year={2026}}@article{qwenrobotworld, title={Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation}, author={Qwen Team}, year={2026}}
")
Try Qwen Studio
Web
iOS
Android
macOS
Windows
Qwen Studio
Qwen Studio Overview
Download
API Platform
Our Flagship Models
Platform Overview
API Platform
Qwen Cloud
Research
Latest Advancements
Research Index
GitHub
Terms & Policies
Terms of Service
Privacy Policy
Usage Policy
Cookies Notice
Training Data Summary
Qwen © 2026