# OpenAI 联合 AMD、Broadcom、Intel、Microsoft 和 NVIDIA 构建网络协议以解决 AI 超级计算机瓶颈

- 来源：The Decoder：AI News（RSS）
- 作者：Matthias Bastian
- 发布时间：2026-05-07 03:13
- AIHOT 分数：58
- AIHOT 链接：https://aihot.virxact.com/items/cmougjwl600a8sl0pspjwrdbw
- 原文链接：https://the-decoder.com/openai-built-a-networking-protocol-with-amd-broadcom-intel-microsoft-and-nvidia-to-fix-ai-supercomputer-bottlenecks

## AI 摘要

OpenAI 与 AMD、Broadcom、英特尔、微软和英伟达共同开发了开源网络协议 MRC。该协议能在 GPU 间通过数百条路径同时传输数据，仅需两层交换机即可连接超过 10 万个 GPU，相比传统方案减少了交换机层级，从而降低了功耗与成本。MRC 协议目前已应用于 OpenAI 的 Stargate 超级计算机上运行。

## 正文

OpenAI built a networking protocol with AMD, Broadcom, Intel, Microsoft, and NVIDIA to fix AI supercomputer bottlenecks

OpenAI has teamed up with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop a new network protocol called MRC, short for Multipath Reliable Connection.

MRC is designed to make data transfers between GPUs in large AI supercomputers faster, more predictable, and more resilient, a key requirement for training large AI models. Instead of sending each transfer over a single network path, MRC spreads packets across hundreds of paths simultaneously, reducing congestion in the core of the network.

When network paths, links, or switches fail, MRC can detect the problem and route around it on a microsecond timescale. Conventional network fabrics can take seconds or even tens of seconds to stabilize after failures, according to OpenAI.

This helps training runs continue through network failures and maintenance events that would previously have disrupted or stalled them. OpenAI says MRC’s multi-plane network design can connect more than 100,000 GPUs using only two tiers of Ethernet switches, instead of the three or four tiers required by conventional 800 Gb/s networks. That reduces power consumption, component count, and overall network cost.

MRC already runs on OpenAI's largest supercomputers

MRC is already deployed across all of OpenAI's largest NVIDIA GB200 supercomputers used to train frontier models, including its Oracle Cloud Infrastructure site in Abilene, Texas, and Microsoft's Fairwater supercomputers.

During the training of a recent frontier model for ChatGPT and Codex, OpenAI says it had to reboot four tier-1 switches. With MRC, the company did not need to coordinate the reboot with the teams running training jobs in the cluster.

The MRC specification was published today through the Open Compute Project (OCP), along with an accompanying research paper. Besides OpenAI, AMD, Broadcom, Intel, Microsoft, and Nvidia contributed to the development.

AI News Without the Hype – Curated by Humans
