Hao AI Lab@haoailab

2026-04-10 04:46·84天前

AI 摘要

FP4硬件虽已普及，但4-bit attention长期存在质量瓶颈，阻碍端到端FP4部署。研究团队提出Attn-QAT，首次系统研究attention机制的量化感知训练。该方法使FP4 attention质量达到BF16水平，同时在RTX 5090上实现比SageAttention3高1.1-1.5倍的吞吐量，在B200上较FlashAttention-4提速1.39倍。

（1/5） FP4 hardware is here， but 4-bit attention still kills model quality， blocking true end-to-end FP4 serving. To fix that， we propose Attn-QAT， the first systematic study of quantization-aware training for attention.

The result： FP4 attention quality is comparable to BF16 attention with 1.1x-1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200.

Blog： https://haoailab.com/blogs/attn-qat/ Code： https://github.com/hao-ai-lab/FastVideo/pull/1225 Checkpoints： https://huggingface.co/FastVideo/14B_qat_400

数据/训练论文/研究部署/工程

在 X 查看原推导出 Markdown

Hao AI Lab@haoailab · X

导出 Markdown

2026-04-10 04:46·84天前

在 X 看原推· x.com

AI 摘要

The result： FP4 attention quality is comparable to BF16 attention with 1.1x-1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200.

Blog： https://haoailab.com/blogs/attn-qat/ Code： https://github.com/hao-ai-lab/FastVideo/pull/1225 Checkpoints： https://huggingface.co/FastVideo/14B_qat_400

数据/训练