论文提出Grouped Query Experts,在分组查询注意力(GQA)基础上让每个token仅路由到少数query头专家。长上下文时prefill速度提升约1.7-1.8倍。250M参数模型经30B tokens训练,最佳版本准确率56.04(baseline 55.86),仅使用16个query注意力计算中的9个。表明GQA内可实现稀疏注意力且不损质量,但需强学习信号和一个始终打开的共享头。
This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs.
Reached about 1.7 to 1.8 times faster prefill when context length became large.
Standard attention makes every token run through every attention head, even when some heads are not useful for that token.
The paper's idea, called Grouped Query Experts, keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts.
Grouped Query Experts sits on top of grouped-query attention, the trick many long-context models already use to reduce key-value cache cost.
This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful.