门控 Delta 网络的大规模特征学习缩放规则

2026-06-02 08:00·31天前

AI 摘要

μP 已实现标准 Transformer 零样本超参数迁移，但扩展到线性模型（尤其带结构化状态转移的门控 Delta 网络）尚未探索。通过在前向传播、门控机制和循环动态中传播坐标规模估计，推导出门控 Delta 网络的缩放规则。语言模型预训练实验证实，该配置在 AdamW 和 SGD 下均实现跨模型宽度稳定学习率迁移，而标准参数化无法迁移。

原文 · 未翻译

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.

HuggingFace Daily Papers（社区热门论文）

69导出 Markdown

门控 Delta 网络的大规模特征学习缩放规则

2026-06-02 08:00·31天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译