将混合专家模型剪枝蒸馏为密集语言模型

2026-05-27 08:00·37天前

AI 摘要

提出首个将已训练MoE模型转换为标准全密集架构的系统性框架：对专家进行评分、选择和分组，拼接为密集前馈网络并通过知识蒸馏精炼。在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上评估了7种评分、5种分组和2种幅度缩放方法共350种配置。新提出的多样性感知评分方法一致优于此前方法。在同等参数量下，MoE转密集相比密集到密集剪枝，经过约4B token蒸馏后平均下游准确率提升6.3个百分点，训练速度提升1.6倍。

原文 · 未翻译

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

HuggingFace Daily Papers（社区热门论文）

44导出 Markdown

将混合专家模型剪枝蒸馏为密集语言模型

2026-05-27 08:00·37天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译