1
Post-Trained MoE Can Skip Half Experts via Self-Distillation
最新研究:后训练MoE模型通过自蒸馏跳过一半专家,无需从头预训练,显著降低计算量。
arXiv:2605.18643v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its …