Taylor-Calibrate:混合线性注意力蒸馏的原则性初始化方法
阅读原文· arxiv.org混合线性注意力模型可加速长上下文推理,但将预训练Transformer转换为Gated DeltaNet学生模型时,直接复制教师注意力投影会导致初始化脆弱,需大量蒸馏token修复。Taylor-Calibrate是一种轻量级初始化方法,利用Taylor引导的教师注意力统计设定值投影、记忆时间尺度、写门和输出门,再通过短逐层对齐匹配教师输出。在四个教师设置和三种保留层策略下,Taylor-Calibrate显著提升零样本学生性能,代表性消融改进高达88倍,达到匹配恢复目标所需训练token比朴素转换少4.9至9.2倍。
Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.