# Nvidia 推出 Cosmos 3：全模态世界模型，让物理AI实现理解、模拟与行动

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-13 22:06
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmqcfk1xw02bqsl9ikhf0fikq
- 原文链接：https://x.com/rohanpaul_ai/status/2065797839507312877

## AI 摘要

Nvidia发布Cosmos 3——一种全模态世界模型，将语言、图像、视频、音频和动作整合到同一系统，使物理AI能跨越“理解、模拟、行动”三大任务。它把动作视为世界的第一类语言，通过动作token设计，让模型可基于视频推断动作，或同时生成未来场景及对应运动。这使机器人从“识别物体”升级为预测“移动、抓取、滑动”等交互后果。相关论文《Cosmos 3: Omnimodal World Models for Physical AI》已发布于arXiv。

## 正文

Nvidia's Cosmos 3： 1 model that can understand， simulate， and act across many physical AI tasks.

It treats action as a first-class language of the world.

Most AI models look at reality from the outside： images become captions， videos become descriptions， and motion becomes something to label after the fact.

Cosmos 3 tries to collapse that distance by putting language， image， video， audio， and action into one shared system， so a robot can connect what it sees with what might happen next and what it should do.

A home robot cannot simply recognize a plate， a table， and a human instruction， because the useful question is what changes when it moves， grasps， slips， bumps， or waits.

That is why the paper's action-token design matters： it turns movement into something the model can condition on， infer from video， or generate alongside a future scene.

----

Link - arxiv. org/abs/2606.02800

Title： "Cosmos 3： Omnimodal World Models for Physical AI"