# SGLang 宣布首日支持 NVIDIA Nemotron 3 Super，助力构建高效多智能体系统

- 来源：LMSYS：Blog（Chatbot Arena 团队）
- 发布时间：2026-03-11 00:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnxbjke5006dsln0gkq3ge3t
- 原文链接：https://www.lmsys.org/blog/2026-03-11-run-nvidia-nemotron-3-super

## AI 摘要

SGLang 首日支持 NVIDIA Nemotron 3 Super 开源模型。该模型采用 120B 总参数、12B 激活参数的混合 MoE 架构，支持 1M token 超长上下文，专为多智能体协作设计。相比前代，吞吐量提升 5 倍，在 Artificial Analysis 智能指数上准确率提高 2 倍。集成 Transformer-Mamba 架构与多 Token 预测技术，兼容 B200、H100 等 GPU，提供完全开放的权重与数据集，适用于代码生成、工具调用等复杂推理场景。

## 正文

We are excited to announce that SGLang supports NVIDIA Nemotron 3 Super on Day 0.

Nemotron 3 Super is a leading open model in the Nemotron 3 family, built for running many collaborating agents together. Agentic systems that chain planning, reasoning, and tools produce far more tokens than single-turn chat; they also need strong reasoning on every step.

Nemotron 3 Super is a 120B-parameter hybrid MoE that activates only 12B parameters per forward pass, giving you leading accuracy for coding, tool calling, and instruction following at a fraction of the cost—plus a 1M-token context so agents keep conversation and plan state in view across long workflows.

Artificial Analysis chart showing Nemotron 3 Super leading on intelligence vs. openness when compared to popular open models of similar size As you can see in the chart above, Nemotron 3 Super leads on the Artificial Analysis Openness index. When compared to other open models, Nemotron is fully open with open-weights, datasets, and recipes so developers can easily customize, optimize, and deploy on their infrastructure for maximum privacy and security.

In this post we walk through installing SGLang and serving Nemotron 3 Super for inference. About Nemotron3 Super **Architecture**: Mixture of Experts (MoE) with Hybrid Transformer-Mamba Architecture Highest throughput efficiency in its size category and up to 5x higher throughput compared to previous Nemotron Super model (Llama Nemotron Super 1.5) Multi-Token Prediction (MTP) : By predicting several future tokens simultaneously in a single forward pass, MTP drastically accelerates the generation of long-form text Supports Thinking Budget for optimal accuracy with minimum reasoning token generation **Accuracy**: Leading accuracy on Artificial Analysis Intelligence Index in its size category Up to 2x higher accuracy on Artificial Analysis Intelligence Index compared to previous Nemotron Super model. Latent MoE enables calling 4 experts for the inference cost of only one **Model size**: 120B total parameters, 12B active parameters **Context length**: up to 1M **Model I/O**: Text in, text out **Supported GPUs**: B200, H100, H200, DGX Spark, RTX 6000 **Get started**: Download model weights from Hugging Face - BF16, FP8 and NVFP4 Run with SGLang for inference Technical report to build custom, optimized models with Nemotron techniques. Installation and Quick Start

For an easier setup with SGLang, refer to our getting started cookbook, available here or through NVIDIA Brev launchable.

Run the command below to install dependencies:

We can then serve this model. The command below is configured for a 4xH200 setup. Refer to the cookbooks for detailed instructions

bash python3 -m sglang.launch_server \ --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ --host 0.0.0.0 \ --port 5000 \ --trust-remote-code \ --tp 4 \ --tool-call-parser qwen3_coder \ --reasoning-parser nemotron_3

from openai import OpenAI The model name we used when launching the server. SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"

BASE_URL = f"http://localhost:5000/v1" API_KEY = "EMPTY" # SGLang server doesn't require an API key by default

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

resp = client.chat.completions.create( model=SERVED_MODEL_NAME, messages=[ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Give me 3 bullet points about SGLang."} ], temperature=0.6, max_tokens=512, ) print("Reasoning:", resp.choices[0].message.reasoning_content, "\nContent:", resp.choices[0].message.content) ``` Nemotron 3 Super is ideal for multi-agent and reasoning workloads

Artificial Analysis chart showing Nemotron 3 Super leading on intelligence vs. efficiency when compared to popular open models of similar size As you can see in the chart above, the model achieves leading accuracy with higher efficiency on Artificial analysis benchmarks making it a strong choice for multi-agent systems that need both efficiency and capability.

The 1M-token context is built for long-horizon agent work: agents can keep full conversation history and plan state in context, and RAG pipelines can supply large document sets in one shot. That reduces fragmentation and goal drift in multi-step workflows.

Together, this makes Super a strong choice for orchestrating and running many agents on a single node—from code generation and debugging to research summarization, alert triage, and document analysis. Get Started

Nemotron 3 Super helps you build scalable, cost-efficient multi-agent AI with high accuracy. With open weights, datasets, and recipes, you get full transparency and the flexibility to fine-tune and deploy on your own infrastructure, from workstation to cloud.

Ready to run multi-agent AI at scale? Download Nemotron 3 Super model weights from Hugging Face - BF16, FP8 and NVFP4 Run with SGLang for inference using the cookbook and through Brev launchable Read the Nemotron 3 Super technical report Acknowledgement

Thanks to everyone who contributed to bringing Nemotron 3 Super to SGLang.

**NVIDIA**: Nirmal Kumar Juluru, Anusha Pant, Max Xu, Daniel Afrimi, Shahar Mor, Roi Koren, Ann Guan and many more **SGLang team and community**: Baizhou Zhang, Jiajun Li, Ke Bao, Lingyan Hao, Mingyi Lu