通过测试时训练线性化Vision Transformer
阅读原文· arxiv.org本研究提出一种将预训练Transformer(如Stable Diffusion 3.5)线性化的方法。通过架构与表示的双重对齐,具体识别了TTT架构与Softmax注意力的结构相似性,并引入关键实例归一化等模块来对齐表示特性。仅在4xH20 GPU上进行1小时微调,所得SD3.5-T^5模型就能达到与微调Softmax模型相当的文本到图像生成质量,同时在1K和2K分辨率下分别实现1.32倍和1.47倍的推理加速。代码已开源。
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.