Rohan Paul@rohanpaul_ai

2026-06-28 07:13·5天前

AI 摘要

论文提出Grouped Query Experts，在分组查询注意力（GQA）基础上让每个token仅路由到少数query头专家。长上下文时prefill速度提升约1.7-1.8倍。250M参数模型经30B tokens训练，最佳版本准确率56.04（baseline 55.86），仅使用16个query注意力计算中的9个。表明GQA内可实现稀疏注意力且不损质量，但需强学习信号和一个始终打开的共享头。

This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs.

Reached about 1.7 to 1.8 times faster prefill when context length became large.

Standard attention makes every token run through every attention head， even when some heads are not useful for that token.

The paper's idea， called Grouped Query Experts， keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts.

Grouped Query Experts sits on top of grouped-query attention， the trick many long-context models already use to reduce key-value cache cost.

This is like giving the model many possible attention patterns， while making each token pay for only the small set that seems useful.

The authors trained 250M-parameter models on 30B tokens and compared the method with a normal grouped-query attention baseline.

The best version matched the baseline's average accuracy， 56.04 versus 55.86， while using 9 of 16 query-attention computations.

shows that attention can be made sparse inside grouped-query attention without hurting quality， but only when the router gets a strong learning signal and one shared head stays always on.

----

Link - arxiv. org/abs/2606.20945

Title： "Grouped Query Experts： Mixture-of-Experts on GQA Self-Attention"

arXiv推理论文/研究

Rohan Paul@rohanpaul_ai · X

44导出 Markdown