激活函数消融研究
阅读原文· blog.eleuther.ai针对类 GPT 自回归语言模型开展激活函数消融实验,系统评估不同激活函数对模型性能的影响。通过对比分析各类激活函数在自回归架构中的表现差异,检验其对模型表达能力、训练稳定性及生成质量的作用机制,为大语言模型的激活函数选择与架构优化提供实验依据。
This was an ablation of activation functions on GPT-like models of ~100M params that I ran ages ago. Each model was run for 10k iters, which isn't very long. My original goal was to show that activation function doesn't matter than much, but to do so I'd need to run a bunch more runs to get variance and show no statistical significance, and I don't plan on running a more exhaustive version of this experiment any time soon. So, I'm just dumping these results here in case anyone has any use for them. All the activation definitions are here.
Name Pile Validation BPB LAMBADA acc LAMBADA ppl softsign 1.1485 34.3 81.32 ReLU 1.1482 34.3 82.01 spike2 1.1480 34.4 83.13 selu 1.1485 34.5 83.32 elish 1.1492 33.9 84.04 tanhexp 1.1474 33.7 84.06 sigmoid 1.1484 33.9 85.20 tanhshrink 1.1483 33.9 85.42 maxtanh 1.1479 33.7 85.53 roottanh 1.1485 33.4 86.00 softplusmone 1.1488 34.1 86.21 logsoftmax 1.1492 34.2 86.29 ELU 1.1496 33.8 86.37 Swish 1.1482 33.7 86.42 softmax 1.1491 33.2 86.74 square_relax 1.1484 33.5 86.92 lisht 1.1500 33.8 87.17 GELU 1.1453 34.0 87.84 abs 1.1489 33.5 87.96 tanh 1.1481 33.2 89.28 Mish 1.1482 33.6 89.84 triangle_relax 1.1502 33.7 89.91 seagull 1.1487 33.3 90.08 maxsig 1.1480 33.3 90.23 softplus 1.1460 33.1 90.74 minsin 1.1498 33.3 91.18 snake 1.1484 33.1 91.93 cosid 1.1490 33.3 92.99 spike 1.1498 33.3 93.78 bipolarsigmoid 1.1513 32.8 96.73