NVIDIA GPU中Shared memory逐代递增而寄存器文件不变,主因是Tensor Core吞吐量翻倍需更大缓冲池。由于全局内存加载速度远不及Tensor Core处理速度且延迟攀升,NVIDIA将Shared memory用作Tensor Core的暂存区。Blackwell虽未提升单SM的Shared memory容量,但借助tcgen05 MMA双SM协同设计,每个SM仅需加载半数操作数,实现等效容量翻倍。
NVIDIA ARCHITECTURE ALERT🚨
Shared memory increased almost every generation, while register file size stayed constant. The reason for this is that Tensor Core throughput increase requires a deeper staging buffer. Because Tensor Cores consume data much faster than global memory can load, we use a staging memory to buffer data, so memory loading can run ahead of MMA operations. Tensor Core throughput doubled every generation, but global memory load latency didn't decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data. To implement this, NVIDIA chose shared memory as the staging memory for Tensor Cores, which explains why shared memory increased but register file size remained constant. However, Blackwell's shared memory size didn't increase from Hopper. This is because tcgen05 MMA can leverage 2 SMs, so each SM's shared memory only needs to load half of the operands. Thus, Blackwell's shared memory size effectively doubled.