# Muon 优于 Adam 的曲率视角解释

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-03 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmq687vre062msl5induieyuh
- 原文链接：https://arxiv.org/abs/2606.04662

## AI 摘要

研究从曲率角度解释 Muon 在 LLM 训练中效率约为 Adam 两倍的原因。二阶泰勒展开显示，两者一阶增益相当，但 Muon 的二阶曲率惩罚更小。曲率惩罚分解为更新范数与归一化方向锐度（NDS），两者更新范数相近，Muon 的 NDS 更低，且数据不平衡会放大这一优势。中后期训练中，Muon 的 NDS 优势主要源自更小的层内曲率。理论证明，Muon 通过平衡不同曲率组间的更新能量实现更小平均 NDS，在曲率异质性足够强时，同等步数下局部二次损失也更低。

## 正文

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.
