Grouped Query Experts:在 GQA 自注意力上的混合专家模型
阅读原文· arxiv.orgGrouped Query Experts(GQE)在分组查询注意力(GQA)的每个组内增加混合专家层,由路由器为每个 token 挑选 k 个 query-head 专家激活,而所有 key-value 头保持密集不变。在 250M 参数规模、30B token 预算下,GQE 在下游准确率上与全激活 GQA 基线持平,同时每 token 仅激活一半 query heads,减少了注意力计算量。
Self-attention is central to Transformer performance and is often the most expensive part of the Transformer at long context lengths because its pairwise token interactions scale quadratically with sequence length. Standard dense attention also applies the same set of attention heads to every token regardless of token difficulty or information content. This uniform activation can waste compute, especially as sequences grow longer and attention cost increases rapidly. We propose Grouped Query Experts (GQE), a mixture-of-experts layer on top of grouped-query attention (GQA). Within each GQA group, a router selects k query-head experts per token while all key-value (KV) heads remain dense and unchanged. Thus, GQE keeps the KV cache benefits of GQA and reduces only the active query-head computation. On a fixed 30B token budget at the 250M parameter scale, GQE matches the all-active GQA baseline in downstream accuracy while activating half the query heads per token.