# SparDA：面向长上下文LLM推理的高效稀疏解耦注意力架构

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-03 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmq9v3a0a0fjsslldb2xtvxmt
- 原文链接：https://arxiv.org/abs/2606.04511

## AI 摘要

SparDA提出解耦稀疏注意力架构，在QKV外引入第四层投影Forecast，预测下一层所需KV块，使CPU到GPU预取与当前层执行重叠。GQA实现中每组使用一个Forecast头。仅增加<0.5%参数，训练仅更新Forecast投影。在8B稀疏预训练模型上匹配或略提升精度，实现prefill加速1.25倍、decode加速1.7倍；相比非offload稀疏基线，单GPU上decode吞吐量提升5.3倍。代码已开源。

## 正文

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains O(T^2) complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds <0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25times prefill speedup and 1.7times decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3times higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.
