Jim Fan@DrJimFan · 2月25日What can half of GPT-1 do? We trained a 42M transformer called SONIC to control the body of a humanoid robot. It takes a remarkable amount of subconscious processing for us humans to squat, turn, crawl, sprint. SONIC captures this "System 1" - the fast, reactive whole-body intelligence - in a single model that translates any motion command into stable, natural motor signals. And it's all open-source!!
The key insight: motion tracking is the one, true scalable task for whole body control. Instead of hand-engineering rewards for every new skill, we use dense, frame-by-frame supervision from human mocap data. The data itself encodes the reward function: "configure your limbs in any human-like position while maintaining balance".
We scaled humanoid motion RL to an unprecedented scale: 100M+ mocap frames and 500,000+ parallel robots across 128 GPUs. NVIDIA Isaac Lab allows us to accelerate physics at 10,000x faster tick, giving robots many years of virtual experience in only hours of wall clock time. After 3 days of training, the neural net transfers zero-shot to the real G1 robot with no finetuning. 100% success rate across 50 diverse real-world motion sequences.
One SONIC policy supports all of the following:
- VR whole-body teleoperation
- Human video. Just point a webcam to live stream motions.
- Text prompts. "Walk sideways", "dance like a monkey", "kick your left foot", etc.
- Music audio. The robot dances to the beat, adapting to tempo and rhythm.
- VLA foundation models. We plugged in GR00T N1.5 and achieved 95% success on mobile tasks.
We open-source the code and model checkpoints!! Deep dive in thread:
译SONIC是一个4200万参数的Transformer模型(规模仅半个GPT-1),通过1亿+动作捕捉帧和50万+并行机器人在NVIDIA Isaac Lab中训练,以密集帧级监督替代手工奖励函数。训练3天后零样本迁移至真实G1机器人,在50种动作序列上达100%成功率。单一策略支持VR遥操作、视频动捕、文本指令、音乐响应及VLA模型控制。项目已完全开源。