AI 摘要
正如我们对DeepSeek发布的期待,DeepSeek V4带来了更多炫目的ML系统优化。 这次是MegaMoE,一个1400行融合CUDA内核,可计算整个MoE前向传播。 让我们看看它是如何工作的(1/4)🧵
As we've come to expect from a DeepSeek release, DeepSeek V4 comes with more flashy ML systems optimizations. This time? MegaMoE, a 1400 line fused CUDA kernel that computes the entire MoE forward pass. Let's see how it works (1/4) 🧵