通过 Nemotron 3 Nano Omni，Nvidia 揭示了现代多模态模型的真实构成

2026-04-29 17:28·64天前·Maximilian Schreiner

AI 摘要

Nvidia 发布了开源多模态模型 Nemotron 3 Nano Omni，该模型能够处理文本、图像、视频和音频。其引人关注之处不仅在于性能表现，更在于其训练数据的构成。模型的部分训练数据来源于 Qwen、GPT-OSS、Kimi 和 DeepSeek OCR 等多个知名项目，这揭示了构建现代多模态模型所需数据集的多样性与复杂性。

原文 · 未翻译

With Nemotron 3 Nano Omni, Nvidia reveals what really goes into a modern multimodal model

Key Points

Nvidia has released Nemotron 3 Nano Omni, an open AI model that processes text, images, video, and audio and is built for agentic applications.

Training involved 717 billion tokens. Much of the synthetic training data comes from competing models like Qwen, gpt-oss, and DeepSeek-OCR.

Along with the model weights, Nvidia is also releasing parts of the training data and pipelines. The model is cleared for commercial use.

Nvidia has released Nemotron 3 Nano Omni, an open multimodal model that handles text, images, video, and audio. The interesting part isn't just the performance - it's the training data, which draws on models like Qwen, GPT-OSS, Kimi, and DeepSeek-OCR.

Nemotron 3 Nano Omni is an open-source multimodal model that processes text, images, video, and audio in a single architecture. The 30-billion-parameter model uses a Mamba-Transformer hybrid with Mixture-of-Experts, activating about three billion parameters per query. It runs on Nvidia's own C-RADIOv4-H vision encoder and the Parakeet-TDT audio encoder, with a context window of up to 256,000 tokens. The only officially supported language is English.

According to the technical report, Nemotron 3 Nano Omni is built mainly for agentic applications: document processing, computer-use agents, video and audio analysis, and voice interaction. On benchmarks like OCRBenchV2, MMLongBench-Doc, WorldSense, and VoiceBench, the model beats its predecessor, Nemotron Nano V2 VL, and goes toe-to-toe with Alibaba's Qwen3-Omni. On OSWorld, a benchmark for GUI agents, accuracy jumps from 11.1 to 47.4 points compared to the previous version. Nvidia says throughput at the same interactivity level is up to nine times higher than Qwen3-Omni.

How rival models shaped the training data

The benchmarks are one thing, but there are also interesting details about the training data, the kind of detail you only get with a true open-source release. Nvidia processed roughly 717 billion tokens across seven training stages, with the context window expanding at each step.

The Decoder：AI News（RSS）

55导出 Markdown