# 将混合专家模型剪枝蒸馏为密集语言模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmq6pd9ed0ajusl5i51p481lp
- 原文链接：https://arxiv.org/abs/2605.28207

## AI 摘要

提出首个将已训练MoE模型转换为标准全密集架构的系统性框架：对专家进行评分、选择和分组，拼接为密集前馈网络并通过知识蒸馏精炼。在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上评估了7种评分、5种分组和2种幅度缩放方法共350种配置。新提出的多样性感知评分方法一致优于此前方法。在同等参数量下，MoE转密集相比密集到密集剪枝，经过约4B token蒸馏后平均下游准确率提升6.3个百分点，训练速度提升1.6倍。

## 正文

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.