# 相同架构，不同能力：优化器诱导的频谱缩放定律

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-20 08:00
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmpgadnmr0dqmsljw55hcqd9s
- 原文链接：https://arxiv.org/abs/2605.21803

## AI 摘要

研究发现，优化器是影响模型表示能力的一个关键维度，挑战了其作为固定训练细节的传统观点。通过对前馈网络表示的特征谱进行分析，研究发现相同Transformer架构在不同优化器下呈现显著不同的频谱缩放规律。在固定设置下，AdamW在学习难度较大的稀有词元表示上仅表现出弱谱秩缩放，而Muon实现了接近线性的缩放，其缩放指数提高了2.3倍。重要的是，这种差异无法仅由验证损失解释，即使损失匹配，表示结构也可能截然不同。研究表明，优化器带来的影响往往超过架构干预，倡导将优化器与架构进行协同设计。

## 正文

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.
