# NVIDIA发布Cosmos Reason 2模型，增强物理AI推理能力

- 来源：Hugging Face：Blog（RSS）
- 发布时间：2026-01-06 06:56
- AIHOT 分数：80
- AIHOT 标记：精选
- AIHOT 链接：https://aihot.virxact.com/items/cmoegbhak009xslxxqlif4h4n
- 原文链接：https://huggingface.co/blog/nvidia/nvidia-cosmos-reason-2-brings-advanced-reasoning

## 精选理由

物理AI推理能力升级，机器人和具身智能落地的关键拼图

## AI 摘要

NVIDIA在Hugging Face上发布了Cosmos Reason 2模型，旨在提升物理AI系统的推理能力。该模型通过改进的推理架构，使AI能更准确地理解和预测物理世界的动态与交互，核心升级包括对复杂场景的多步推理、不确定性量化及时间序列数据的深度理解。这一进展将推动机器人、自动驾驶等领域的发展，使AI在现实环境中的决策更可靠、更符合物理规律。

## 正文

NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

Enterprise + Article Published January 5, 2026

Tsung-Yi Lin

tsungyi

nvidia

Debraj Sinha

debrajsinha

nvidia

NVIDIA today released Cosmos Reason 2, the latest advancement in open, reasoning vision language models for physical AI. Cosmos Reason 2 surpasses its previous version in accuracy and tops the Physical AI Bench and Physical Reasoning leaderboards as the #1 open model for visual understanding.

NVIDIA Cosmos Reason 2: Reasoning Vision Language Model for Physical AI

Since their introduction, vision-language models have rapidly improved at tasks like object and pattern recognition in images. But they still struggle with tasks humans find natural, like planning several steps ahead, dealing with uncertainty or adapting to new situations. Cosmos Reason is designed to close this gap by giving robots and AI agents stronger common sense and reasoning to solve complex problems step by step.

Cosmos Reason 2 is a state-of-the-art, open reasoning vision-language model (VLM) that enables robots and AI agents to see, understand, plan, and act in the physical world like humans. It uses common sense, physics, and prior knowledge to recognize how objects move across space and time to handle complex tasks, adapt to new situations, and figure out how to solve problems step by step.

✨ Key Highlights

Improved spatio-temporal understanding and timestamp precision.

Optimized performance with flexible deployment options from edge to cloud with 2B and 8B parameters model sizes.

Support for expanded set of spatial understanding and visual perception capabilities — 2D/3D point localization, bounding box coordinates, trajectory data, and OCR support.

Improved long-context understanding with 256K input tokens, up from 16K with Cosmos Reason 1.

Adaptable to multiple use cases with easy-to-use Cosmos Cookbook recipes.

🤖 Popular Use Cases

Video analytics AI agents — These agents can extract valuable insights from massive volumes of video data to optimize processes. Cosmos Reason 2 builds on the capabilities of Cosmos Reason 1 and now provides OCR support, as well as 2D/3D point localization and a set of mark understanding.

Example of how Cosmos Reason can understand text embedded within a video to determine the condition of the road during a rainstorm.

Developers can jumpstart development of video analytics AI agents by using the NVIDIA blueprint for video search and summarization (VSS) with Cosmos Reason as the VLM.

Salesforce is transforming workplace safety and compliance by analyzing video footage captured by Cobalt robots with Agentforce and VSS blueprint with Cosmos Reason as the VLM.

Data annotation and critique — Enable developers to automate high-quality annotation and critique of massive, diverse training datasets. Cosmos Reason provides time stamps and detailed descriptions for real or synthetically generated training videos.

Example of a sample prompt to generate detailed, time-stamped captions for a race car video.

Uber is exploring Cosmos Reason 2 to deliver accurate, searchable video captions for autonomous vehicle (AV) training data, enabling efficient identification of critical driving scenarios. This co-authored Reason 2 for AV Video Captioning and VQA recipe demonstrates how to fine-tune and evaluate Cosmos Reason 2-8B on annotated AV videos. Across multiple evaluation metrics, measurable improvements were achieved: BLEU scores improved 10.6% (0.113 → 0.125), MCQ-based VQA gained 0.67 percentage points (80.18% → 80.85%), and LingoQA increased 13.8% (63.2% → 77.0%). These gains demonstrate effective domain adaptation for AV applications.

Robot planning and reasoning — Act as the brain for deliberate, methodical decision-making in a robot vision language action (VLA) model. Cosmos Reason 2 now provides trajectory coordinates in addition to determining next steps.

Example of the prompt and JSON output from Cosmos Reason 2 to provide the steps and trajectory the robot gripper needs to take to move the painter’s tape into the basket.

Encord provides native support for Cosmos Reason 2 in its Data Agent library and AI data platform, enabling developers to leverage Cosmos Reason 2 as a VLA for robotics and other physical AI use cases.

Companies like Hitachi, Milestone and VAST Data are using Cosmos Reason to advance robotics, autonomous driving, and video analytics AI agents for traffic and workplace safety.

Try Cosmos Reason 2 on build.nvidia.com and experience the latest features with sample prompts for generating bounding boxes and robot trajectories. Upload your own videos and images for further analysis.

Download Cosmos Reason 2 models (2B and 8B) on Hugging Face or use Cosmos Reason 2 in the cloud. The model will be available soon on Amazon Web Services, Google Cloud and Microsoft Azure. To get started, check out Cosmos Reason 2 documentation and the Cosmos Cookbook.

Other Models From The Cosmos Family:

🔮 Cosmos Predict 2.5

Cosmos Predict is a generative AI model that predicts future states of the physical world as video, based on text, image, or video inputs.

Physical AI Bench leader for quality, accuracy and overall consistency.

Up to 30 seconds of physically and temporally consistent clip per generation.

Supports multiple framerates and resolution.

Pre-trained on 200 million clips.

Available as 2B and 14B pre-trained models and various 2B post-trained models for multiview, action conditioning and autonomous vehicle training.

Check out model card>>

🔁 Cosmos Transfer 2.5

Cosmos Transfer is our lightest multicontrol model built for video to world style transfer.

Scale a single simulation or spatial video across various environments and lighting conditions.

Improved prompt adherence and physics alignment.

Use with NVIDIA Isaac Sim™ or NVIDIA Omniverse NuRec for simulation to real transformation.

Check out model card>>

🤖 NVIDIA GR00T N1.6

NVIDIA GR00T N1.6 is an open reasoning vision language action (VLA) model, purpose-built for humanoid robots, that unlocks full body control and uses NVIDIA Cosmos Reason for better reasoning and contextual understanding.

Resources

▶️ Watch a demo of Cosmos → https://youtu.be/iWs-2TD5Dcc

🧑🏻‍🍳 Read the Cosmos Cookbook → https://nvda.ws/4qevli8

📚 Explore Models & Datasets → https://github.com/nvidia-cosmos

⬇️ Try Cosmos Models in our Hosted Catalog → https://nvda.ws/3Yg0Dcx

💻 Join the Cosmos Community → https://discord.gg/u23rXTHSC9

🗳️ Contribute to the Cosmos Cookbook → https://nvda.ws/4aQcBkk

Models mentioned in this article 4

Spaces mentioned in this article 2

Community

merve

Jan 6

very bullish on embodied VLA/VLMs this year 🔥

mindchain

Jan 6

looks so cool!

deleted

Jan 8

This comment has been hidden

Daniel6316

Jan 12

Great update from NVIDIA—Cosmos Reason 2 looks like a big step forward for physical AI, especially with its improved reasoning and decision-making capabilities. It’ll be exciting to see how this advances real-world robotics and autonomous systems.

deleted

Jan 13

dsa

deleted

Jan 13

ads

· or to comment

Models mentioned in this article 4

Spaces mentioned in this article 2
