# FlashInfer开源近1400个高性能GPU内核

- 来源：SemiAnalysis (@SemiAnalysis_)
- 发布时间：2026-04-17 06:46
- AIHOT 链接：https://aihot.virxact.com/items/cmo22ynym014kslbae43i1bp9
- 原文链接：https://x.com/SemiAnalysis_/status/2044910423829254615

## AI 摘要

FlashInfer开源近1400个TRT-LLM-Gen高性能GPU内核，针对LLM推理优化。以W4A16量化GEMM为例，采用INT4权重与BF16激活，通过3级流水线及Warp专精化（加载、反量化、MMA、Epilogue）提升并行效率。因INT4反量化需CUDA核心处理寄存器，MMA被迫使用TS模式而非TMEM，导致SMEM带宽瓶颈。方案借鉴Cursor设计，通过流水线隐藏CUDA与Tensor Core计算差距，缓解吞吐量损失。

## 正文

Curious what's in the PR of almost 1400 kernels？

Here we walk through a simple batched GEMM kernel
🟠 Tile size： M128， N16， K256
🟠W4A16： matrix A is INT4 with BF16 scaling factor for every 32 elements， matrix B is BF16
🟠3 pipeline stages
🟠1 CTA MMA
🟠Static scheduler

This warp specialized kernel has the following warp roles：
🟠Load A
🟠Load A scaling factor （SF）
🟠Load B
🟠Cast A： Dequantize INT4 to BF16. Waits on Load A and Load A SF
🟠MMA： Performs matmul. Waits on Cast A and Load B
🟠Epilogue： Performs activation computation. Waits on MMA

An interesting thing about this kernel is that its MMA uses TS mode due to matrix A dequantization requires CUDA cores， which work on registers instead of TMEM.

As shown in our microbenchmarking article， TS mode has slightly lower throughput due to SMEM bandwidth bottleneck. In addition， @cursor_ai also shown that the CUDA core / Tensor Core compute gap also creates bottlenecks.

To mitigate these issues， we see the kernel uses pipelining， similar to what Cursor did.

Microbenchmarking article： https://newsletter.semianalysis.com/p/dissecting-nvidia-blackwell-tensor

Cursor blog post： https://cursor.com/blog/kernels

### 引用推文

> Alex Zhurkevich：Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powerin...