SemiAnalysis@SemiAnalysis_

2026-04-17 06:46·77天前

AI 摘要

FlashInfer开源近1400个TRT-LLM-Gen高性能GPU内核，针对LLM推理优化。以W4A16量化GEMM为例，采用INT4权重与BF16激活，通过3级流水线及Warp专精化（加载、反量化、MMA、Epilogue）提升并行效率。因INT4反量化需CUDA核心处理寄存器，MMA被迫使用TS模式而非TMEM，导致SMEM带宽瓶颈。方案借鉴Cursor设计，通过流水线隐藏CUDA与Tensor Core计算差距，缓解吞吐量损失。

Curious what's in the PR of almost 1400 kernels？

Here we walk through a simple batched GEMM kernel 🟠 Tile size： M128， N16， K256 🟠W4A16： matrix A is INT4 with BF16 scaling factor for every 32 elements， matrix B is BF16 🟠3 pipeline stages 🟠1 CTA MMA 🟠Static scheduler

This warp specialized kernel has the following warp roles： 🟠Load A 🟠Load A scaling factor （SF） 🟠Load B 🟠Cast A： Dequantize INT4 to BF16. Waits on Load A and Load A SF 🟠MMA： Performs matmul. Waits on Cast A and Load B 🟠Epilogue： Performs activation computation. Waits on MMA

An interesting thing about this kernel is that its MMA uses TS mode due to matrix A dequantization requires CUDA cores， which work on registers instead of TMEM.

As shown in our microbenchmarking article， TS mode has slightly lower throughput due to SMEM bandwidth bottleneck. In addition， @cursor_ai also shown that the CUDA core / Tensor Core compute gap also creates bottlenecks.

To mitigate these issues， we see the kernel uses pipelining， similar to what Cursor did.

Microbenchmarking article： https://newsletter.semianalysis.com/p/dissecting-nvidia-blackwell-tensor

Cursor blog post： https://cursor.com/blog/kernels

Alex ZhurkevichTrtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powerin...

开源/仓库部署/工程

SemiAnalysis@SemiAnalysis_ · X

导出 Markdown

2026-04-17 06:46·77天前

在 X 看原推· x.com

AI 摘要

Curious what's in the PR of almost 1400 kernels？

An interesting thing about this kernel is that its MMA uses TS mode due to matrix A dequantization requires CUDA cores， which work on registers instead of TMEM.