Taylor-Calibrate：混合线性注意力蒸馏的原则性初始化方法

2026-06-15 08:00·18天前

AI 摘要

混合线性注意力模型可加速长上下文推理，但将预训练Transformer转换为Gated DeltaNet学生模型时，直接复制教师注意力投影会导致初始化脆弱，需大量蒸馏token修复。Taylor-Calibrate是一种轻量级初始化方法，利用Taylor引导的教师注意力统计设定值投影、记忆时间尺度、写门和输出门，再通过短逐层对齐匹配教师输出。在四个教师设置和三种保留层策略下，Taylor-Calibrate显著提升零样本学生性能，代表性消融改进高达88倍，达到匹配恢复目标所需训练token比朴素转换少4.9至9.2倍。

原文 · 未翻译

Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.

HuggingFace Daily Papers（社区热门论文）

44导出 Markdown

Taylor-Calibrate：混合线性注意力蒸馏的原则性初始化方法

2026-06-15 08:00·18天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译