NVIDIA推出Nemotron 3 Nano Omni模型,其定位并非通用聊天机器人,而是作为智能体系统中的轻量级感知模块。该模型采用30B-3B混合专家架构,在处理视觉、音频和文本多模态输入时,吞吐量可比同类开源全模态模型提升高达9倍。它旨在充当多智能体栈中的“眼睛和耳朵”,负责感知屏幕、文档和音频等信息,并将结构化上下文传递给如Nemotron Super(执行)和Ultra(规划)等推理层,从而优化大规模、高频率调用的智能体工作流。模型完全开源,现已登陆Hugging Face。
NVIDIA just launched Nemotron 3 Nano Omni. Not the first omni-model, but built for a different job. And that makes it really interesting:
Models like ChatGPT and Gemini already handle vision, audio, and text. What they're not optimized for is running as a lightweight perception sub-agent inside agentic systems, where the model gets called hundreds of times in a loop.
That's the gap Nemotron 3 Nano Omni fills. A 30B-A3B mixture-of-experts architecture that delivers up to 9x higher throughput than comparable open omni models. Not smarter but faster and cheaper at scale.
The design logic: It acts as the "eyes and ears" in a multi-agent stack, paired with Nemotron Super for execution and Ultra for planning. One model sees screens, reads documents, hears audio. Then hands structured context to the reasoning layer.