原文 · 未翻译
Nvidia bets big on physical AI at GTC Taipei with a new world model, driving brain, and open humanoid robot
Nvidia used GTC Taipei to launch a series of models for robots, autonomous vehicles, and video systems. The centerpieces are the new world model Cosmos 3, a significantly scaled-up driving model called Alpamayo 2 Super, and an open reference platform for humanoid robots.
Cosmos 3 is Nvidia's next version of its open "omnimodel," which processes text, images, video, ambient audio, and action data in a single system. Developers building robots, autonomous vehicles, and video surveillance systems can use it to generate synthetic training data, interpret scenes, and predict future world states without having to painstakingly recreate those situations in the real world.
Nvidia names three use cases. As a vision-language model, Cosmos 3 analyzes video, for example to detect traffic anomalies in smart cities, as partner Linker Vision is already doing.
As a world model, it generates photorealistic video sequences of rare situations like near-misses or unusual object arrangements in a warehouse.
And as the basis for so-called world-action models, it produces numerical motion data like joint angles or gripper positions that robots use to learn tasks such as picking and placing, as industrial partner Agile Robots demonstrates.
The architecture uses a mixture-of-transformers approach: one reasoning transformer analyzes a scene, then a second generation transformer produces videos, descriptions, or motion trajectories from that analysis. Training data included billions of examples spanning text, images, video, audio, and action data. Nvidia offers three variants: Cosmos 3 Super delivers the best current quality, Nano is built for fast inference, and a forthcoming Edge model targets real-time operation on embedded systems. The models are available under the OpenMDW-1.1 license on Hugging Face and GitHub.
The release comes alongside the "Cosmos Coalition," a partner group that includes Black Forest Labs, Runway, LTX, Generalist, Agile Robots, and Skild AI. In practice, it's an alliance that uses Nvidia's DGX Cloud training infrastructure and contributes models and data in return.