针对Mixture-of-Experts模型的置信度自适应SwiGLU
阅读原文· arxiv.org本研究提出了置信度自适应SwiGLU,这是面向Mixture-of-Experts模型的一种SwiGLU变体。该方法根据token级的路由置信度动态调整专家门控的锐度,通过将SiLU门控的锐度系数参数化为路由器对数几率的可学习函数,使每个门控单元能在平滑的广泛激活与尖锐的选择性门控之间自适应插值。在FineWeb-Edu数据集上针对不同规模的MoE Transformer模型评估表明,κ-SwiGLU在引入极少量额外参数和微小计算开销的前提下,提升了模型的平均CORE性能。
SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.