# HydraHead：从头部级功能异质性到专用注意力混合

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-18 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqq2kjqi060lslp5s3erlgna
- 原文链接：https://arxiv.org/abs/2606.20097

## AI 摘要

HydraHead 沿 head 轴混合 Full Attention（FA）与 Linear Attention（LA），通过可解释性驱动策略保留检索关键 head 的 FA，并设计缩放归一化融合模块弥合二者输出分布差异。借助三阶段迁移流水线（参数复用与知识蒸馏），仅训练 15B tokens，HydraHead 在 512K 上下文长度上较基线提升超 69%，以 7:1 的 LA-to-FA 比例达到 3:1 逐层混合的长上下文性能，接近同体量原生 256K 上下文长度的 Qwen3.5。

## 正文

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.