Rohan Paul@rohanpaul_ai

2026-06-13 22:06·19天前

AI 摘要

Nvidia发布Cosmos 3——一种全模态世界模型，将语言、图像、视频、音频和动作整合到同一系统，使物理AI能跨越“理解、模拟、行动”三大任务。它把动作视为世界的第一类语言，通过动作token设计，让模型可基于视频推断动作，或同时生成未来场景及对应运动。这使机器人从“识别物体”升级为预测“移动、抓取、滑动”等交互后果。相关论文《Cosmos 3: Omnimodal World Models for Physical AI》已发布于arXiv。

Nvidia's Cosmos 3： 1 model that can understand， simulate， and act across many physical AI tasks.

It treats action as a first-class language of the world.

Most AI models look at reality from the outside： images become captions， videos become descriptions， and motion becomes something to label after the fact.

Cosmos 3 tries to collapse that distance by putting language， image， video， audio， and action into one shared system， so a robot can connect what it sees with what might happen next and what it should do.

A home robot cannot simply recognize a plate， a table， and a human instruction， because the useful question is what changes when it moves， grasps， slips， bumps， or waits.

That is why the paper's action-token design matters： it turns movement into something the model can condition on， infer from video， or generate alongside a future scene.

----

Link - arxiv. org/abs/2606.02800

Title： "Cosmos 3： Omnimodal World Models for Physical AI"

arXiv具身智能多模态模型发布

在 X 查看原推

Rohan Paul@rohanpaul_ai · X

68导出 Markdown