GPU utils is low compared with dense model

Question

GPU utils is low compared with dense model

charliedream1 opened this issue 8 months ago · comments

training time is longer than 14B model. GPU utils is low, it always drop to 0 and raise up again to 100%.

wangding zeng · Answer 1 · Fri Jan 12 2024 15:07:58 GMT+0800 (China Standard Time)

currently，this code has not been extensively optimized for efficient training(e.g. fused moe kernel or expert parallelism). When resources are sufficient, it is recommended that you use zero1 to reduce communication overhead.

Optimus Prime · Answer 2 · Fri Jan 12 2024 15:09:45 GMT+0800 (China Standard Time)

thanks for reply. Hopefully, these optimization will be added soon

Optimus Prime · Answer 3 · Sun Jan 14 2024 12:06:49 GMT+0800 (China Standard Time)

So what would be the reason of much slower by equally compared with 14B model without optimizations? Activation params is only 2.7B，however, training is much slower than 14B model, even slower than 20B model. Is that because too many experts calculated in a pipeline but not parallel?

…

---Original--- From: "wangding ***@***.***> Date: Fri, Jan 12, 2024 15:08 PM To: ***@***.***>; Cc: "Optimus ***@***.******@***.***>; Subject: Re: [deepseek-ai/DeepSeek-MoE] GPU utils is low compared with densemodel (Issue #5) currently，this code has not been extensively optimized for efficient training(e.g. fused moe kernel or expert parallelism). When resources are sufficient, it is recommended that you use zero1 to reduce communication overhead. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Chong Ruan · Answer 4 · Sun Feb 04 2024 10:50:00 GMT+0800 (China Standard Time)

So what would be the reason of much slower by equally compared with 14B model without optimizations? Activation params is only 2.7B，however, training is much slower than 14B model, even slower than 20B model. Is that because too many experts calculated in a pipeline but not parallel?

too many fragmented matmul and network communication