原文 · 未翻译
With Nemotron 3 Nano Omni, Nvidia reveals what really goes into a modern multimodal model
Key Points
Nvidia has released Nemotron 3 Nano Omni, an open AI model that processes text, images, video, and audio and is built for agentic applications.
Training involved 717 billion tokens. Much of the synthetic training data comes from competing models like Qwen, gpt-oss, and DeepSeek-OCR.
Along with the model weights, Nvidia is also releasing parts of the training data and pipelines. The model is cleared for commercial use.
Nvidia has released Nemotron 3 Nano Omni, an open multimodal model that handles text, images, video, and audio. The interesting part isn't just the performance - it's the training data, which draws on models like Qwen, GPT-OSS, Kimi, and DeepSeek-OCR.
Nemotron 3 Nano Omni is an open-source multimodal model that processes text, images, video, and audio in a single architecture. The 30-billion-parameter model uses a Mamba-Transformer hybrid with Mixture-of-Experts, activating about three billion parameters per query. It runs on Nvidia's own C-RADIOv4-H vision encoder and the Parakeet-TDT audio encoder, with a context window of up to 256,000 tokens. The only officially supported language is English.
According to the technical report, Nemotron 3 Nano Omni is built mainly for agentic applications: document processing, computer-use agents, video and audio analysis, and voice interaction. On benchmarks like OCRBenchV2, MMLongBench-Doc, WorldSense, and VoiceBench, the model beats its predecessor, Nemotron Nano V2 VL, and goes toe-to-toe with Alibaba's Qwen3-Omni. On OSWorld, a benchmark for GUI agents, accuracy jumps from 11.1 to 47.4 points compared to the previous version. Nvidia says throughput at the same interactivity level is up to nine times higher than Qwen3-Omni.
How rival models shaped the training data
The benchmarks are one thing, but there are also interesting details about the training data, the kind of detail you only get with a true open-source release. Nvidia processed roughly 717 billion tokens across seven training stages, with the context window expanding at each step.