deepseek-ai / DeepSeek-MoE

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU utils is low compared with dense model

charliedream1 opened this issue · comments

training time is longer than 14B model. GPU utils is low, it always drop to 0 and raise up again to 100%.

currently,this code has not been extensively optimized for efficient training(e.g. fused moe kernel or expert parallelism). When resources are sufficient, it is recommended that you use zero1 to reduce communication overhead.

thanks for reply. Hopefully, these optimization will be added soon

So what would be the reason of much slower by equally compared with 14B model without optimizations? Activation params is only 2.7B,however, training is much slower than 14B model, even slower than 20B model. Is that because too many experts calculated in a pipeline but not parallel?

too many fragmented matmul and network communication