SpaceX几乎已完成用C语言编写的内部AI训练栈V1.0版本开发,该栈精确适配22万块GB300芯片与800G网卡,大量采用流水线并行技术,并尽可能接近裸金属性能。 与JAX相比,其在大规模训练任务中的潜在速度提升超过一个数量级。
SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.
The potential speed improvement vs JAX for large training runs is over an order of magnitude.