多路径可靠连接(MRC)是一种新型RDMA传输协议,由NVIDIA、微软和OpenAI联合推出,并与AMD、博通和英特尔合作。该协议首先在NVIDIA Spectrum-X以太网硬件上得到验证和优化。MRC的核心创新是改变连接方式,允许单个RDMA数据流利用多条网络路径传输AI训练流量,而非强制每个GPU连接走单一固定路由。RDMA技术使GPU能以极少CPU帮助移动数据,这对于数千GPU在训练中不断交换模型更新至关重要。当网络出现拥塞、链路故障或交换机过载时,流量可自动绕行,无需软件层面修复,从而避免单一不良路径拖慢整个计算集群,保障大规模AI训练任务的高效进行。
MRC was introduced by NVIDIA, Microsoft, and OpenAI, along with collaborated with AMD, Broadcom, Intel.
Multipath Reliable Connection is a new RDMA transport protocol, proven first and optimized on NVIDIA Spectrum-X Ethernet hardware.
Spreads AI training traffic across many paths instead of forcing each GPU connection through one route.
Basically, it is a new way to move training data between huge numbers of GPUs without letting one bad network path slow the whole cluster.
RDMA lets GPUs move data through the network with very little CPU help, which is crucial when thousands of GPUs must exchange model updates constantly during one training run.