SwiGLU在现代大语言模型中无处不在——但对于大输入,它的行为类似于x²。这种二次增长会膨胀激活值,放大异常值,并使深层网络或低精度(FP8/FP4)训练容易出现损失尖峰。 我们提出了PowLU,一种为稳定大规模预训练而设计的即插即用激活函数。🧵
SwiGLU is everywhere in modern LLMs - but for large inputs it behaves like x2. That quadratic blow-up inflates activations, amplifies outliers, and makes deep network or low-precision (FP8/FP4) training prone to loss spikes.
We propose PowLU, a drop-in activation built for stable large-scale pre-training. 🧵