GPU性能仍有提升空间。 在大规模手动调整CUDA内核时,几乎不可能弥合理论峰值与实际吞吐量之间的差距。 那么,为什么手写CUDA内核会输给自动生成的版本? Makora的Mohamed Abdelfattah有一个解决方案:https://youtu.be/ukzACWrk0W0?si=whrH_WsHltmF_J7B
GPUs are leaving performance on the table.
Closing the gap between theoretical peak and real-world throughput is nearly impossible when hand-tuning CUDA kernels at scale.
So why are hand-written CUDA kernels losing to auto-generated ones?
Mohamed Abdelfattah at Makora has a solution: https://youtu.be/ukzACWrk0W0?si=whrH_WsHltmF_J7B