FP4硬件虽已普及,但4-bit attention长期存在质量瓶颈,阻碍端到端FP4部署。研究团队提出Attn-QAT,首次系统研究attention机制的量化感知训练。该方法使FP4 attention质量达到BF16水平,同时在RTX 5090上实现比SageAttention3高1.1-1.5倍的吞吐量,在B200上较FlashAttention-4提速1.39倍。
(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention.
The result: FP4 attention quality is comparable to BF16 attention with 1.1x-1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200.
Blog: https://haoailab.com/blogs/attn-qat/ Code: https://github.com/hao-ai-lab/FastVideo/pull/1225 Checkpoints: https://huggingface.co/FastVideo/14B_qat_400