SemiAnalysis@SemiAnalysis_

2026-05-28 06:39·36天前

AI 摘要

GPU性能仍有提升空间。在大规模手动调整CUDA内核时，几乎不可能弥合理论峰值与实际吞吐量之间的差距。那么，为什么手写CUDA内核会输给自动生成的版本？ Makora的Mohamed Abdelfattah有一个解决方案：https://youtu.be/ukzACWrk0W0?si=whrH_WsHltmF_J7B

GPUs are leaving performance on the table.

Closing the gap between theoretical peak and real-world throughput is nearly impossible when hand-tuning CUDA kernels at scale.

So why are hand-written CUDA kernels losing to auto-generated ones？

Mohamed Abdelfattah at Makora has a solution： https://youtu.be/ukzACWrk0W0?si=whrH_WsHltmF_J7B

教程/实践数据/训练部署/工程

在 X 查看原推导出 Markdown

SemiAnalysis@SemiAnalysis_ · X

55导出 Markdown