# Attn-QAT实现FP4注意力量化，质量媲美BF16且提速1.5倍

- 来源：Hao AI Lab (@haoailab)
- 发布时间：2026-04-10 04:46
- AIHOT 链接：https://aihot.virxact.com/items/cmnxjn85o00glsl9o06rccz9l
- 原文链接：https://x.com/haoailab/status/2042343429108351116

## AI 摘要

FP4硬件虽已普及，但4-bit attention长期存在质量瓶颈，阻碍端到端FP4部署。研究团队提出Attn-QAT，首次系统研究attention机制的量化感知训练。该方法使FP4 attention质量达到BF16水平，同时在RTX 5090上实现比SageAttention3高1.1-1.5倍的吞吐量，在B200上较FlashAttention-4提速1.39倍。

## 正文

（1/5） FP4 hardware is here， but 4-bit attention still kills model quality， blocking true end-to-end FP4 serving.
To fix that， we propose Attn-QAT， the first systematic study of quantization-aware training for attention.

The result： FP4 attention quality is comparable to BF16 attention with 1.1x-1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200.

Blog： https://haoailab.com/blogs/attn-qat/
Code： https://github.com/hao-ai-lab/FastVideo/pull/1225
Checkpoints： https://huggingface.co/FastVideo/14B_qat_400
