Nvidia发布Cosmos 3——一种全模态世界模型,将语言、图像、视频、音频和动作整合到同一系统,使物理AI能跨越“理解、模拟、行动”三大任务。它把动作视为世界的第一类语言,通过动作token设计,让模型可基于视频推断动作,或同时生成未来场景及对应运动。这使机器人从“识别物体”升级为预测“移动、抓取、滑动”等交互后果。相关论文《Cosmos 3: Omnimodal World Models for Physical AI》已发布于arXiv。
Nvidia's Cosmos 3: 1 model that can understand, simulate, and act across many physical AI tasks.
It treats action as a first-class language of the world.
Most AI models look at reality from the outside: images become captions, videos become descriptions, and motion becomes something to label after the fact.
Cosmos 3 tries to collapse that distance by putting language, image, video, audio, and action into one shared system, so a robot can connect what it sees with what might happen next and what it should do.
A home robot cannot simply recognize a plate, a table, and a human instruction, because the useful question is what changes when it moves, grasps, slips, bumps, or waits.