对称兼容优化器设计原则
阅读原文· arxiv.org深度学习中,神经网络结构具有对称性,而主流优化器按坐标独立更新,两者存在不匹配。本研究提出对称兼容原则,要求优化器的梯度更新规则在相应参数块的对称群作用下保持等变性。基于此,研究为通用矩阵层提供了统一视角,并推导了适用于嵌入层、LM头、SwiGLU MLP投影矩阵及MoE路由器等不同对称性参数块的专用优化器,形成端到端的逐层优化器栈。实验表明,在稠密与稀疏MoE模型的预训练中,对称兼容更新相比AdamW一致提升了验证损失,并增强了训练稳定性。
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.