# 门控 Delta 网络的大规模特征学习缩放规则

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-02 08:00
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmpz0edm20524sli3yudqenhp
- 原文链接：https://arxiv.org/abs/2606.04048

## AI 摘要

μP 已实现标准 Transformer 零样本超参数迁移，但扩展到线性模型（尤其带结构化状态转移的门控 Delta 网络）尚未探索。通过在前向传播、门控机制和循环动态中传播坐标规模估计，推导出门控 Delta 网络的缩放规则。语言模型预训练实验证实，该配置在 AdamW 和 SGD 下均实现跨模型宽度稳定学习率迁移，而标准参数化无法迁移。

## 正文

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.
