# DOT-MoE：用于MoE化的可微最优传输

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：66
- AIHOT 链接：https://aihot.virxact.com/items/cmpx1m4az0064slgybbkwzxsy
- 原文链接：https://arxiv.org/abs/2606.01666

## AI 摘要

DOT-MoE提出了一种将预训练密集模型转换为Mixture of Experts架构的新框架。该方法将密集层的分解建模为一个可微最优传输问题，利用Sinkhorn-Knopp迭代来实施严格的专家容量约束。同时，通过Straight-Through Estimators端到端联合学习离散的神经元-专家分配与token-专家路由策略。实验表明，DOT-MoE在多个基准测试中显著优于结构化剪枝等基线方法，能够在减少50%活跃参数的同时，保留原始密集模型90%的性能。

## 正文

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.
