Jim Fan@DrJimFan · 2月5日New milestone: we trained a robot foundation model on a world model backbone, and enabled zero-shot, open-world prompting capability for new verbs, nouns, and environments. If the world model can "dream" the right future in pixels, then the robot can execute well in motors. We call it "DreamZero", our first World Action Model (WAM).
Our team had tons of fun at the lab typing anything we like into an open text prompt, and watch the robot perform tasks it was never trained on. An emergent capability we didn't quite expect. Obviously not GPT-3 reliable yet, but we are marching into the GPT-2 era.
Discoveries:
- Model and data recipe co-evolve. Compared to VLAs, WAMs learn best from diverse data, breaking away from the conventional wisdom that lots of repeated demos per task are the bread and butter. Diversity >> repetitions.
- X-embodiment is extremely hard. Pixels are the answer. Different robot morphologies traditionally have a hard time sharing knowledge well. But if we put video first, pixels become the universal bridge connecting different hardware - even videos of human first-person view.
DreamZero shows significant robot2robot and human2robot transfer. With only 55 trajectories on a *new*, unseen hardware (~30 min of teleop), it adapts so quickly and retains zero-shot prompting ability.
Yesterday I posted about the "Second Pre-training Paradigm": world models are the next-gen foundation of Physical AI, not language backbones.
Today, we are proving it works. And 2026 has just begun.
Paper: World Action Models are Zero-Shot Policies.
Read it now: (thread)
译团队发布DreamZero,首个基于世界模型骨干的World Action Model (WAM)。该模型突破传统Vision-Language-Action范式,通过像素级世界模型实现零样本开放世界提示能力,可执行未训练过的新任务。研究发现WAM依赖多样化数据而非重复演示,并以像素作为跨具身的通用桥梁,实现robot2robot和human2robot知识迁移。仅需55条轨迹(约30分钟遥操作)即可适应全新硬件,验证世界模型作为Physical AI下一代基础的可行性。