# 尺寸可忽略，影响却显著：论大语言模型中的缩放向量

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpnfrc8m0x72sl012k81qj2p
- 原文链接：https://arxiv.org/abs/2605.26895

## AI 摘要

本研究系统考察了大语言模型中可学习缩放向量的作用。尽管其参数占比极小，但移除后会显著损害模型预训练效果。研究表明，在Pre-Norm架构中，其主要作用并非增强模型表达能力，而是通过自放大的预处理效应优化后续的线性映射。此外，权重衰减对Input-Norm层有益，但对Output-Norm层有害。基于这些发现，文章提出了三种轻量化改进：分支特异性异构性、线性映射周围的改进放置以及幅度-方向重参数化，并将其整合为一个统一的策略。实验验证表明，该策略在以可忽略不计的额外参数和计算开销下，能一致实现更低的最终损失和更优的缩放行为。

## 正文

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.