microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qs

zws98 opened this issue · comments

I trained MOE on 8 gpus with 8 experts. When I conducted the inference in parallel, I found each process had a similar but different result. I would like to ask you what could be the cause of this?

Maybe you can consider if drop-less MOE mode can solve your issue, which is achieved by setting capacity_factor=0

The results are still diverse for each process and the results are different from setting capacity_factor=1.25.

Do you have more information? I didn't get what you said.

Outputs from different GPUs:

STEP-10: loss = 21.11541, step_time = 3.628716 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3628715753555298 sec.
STEP-10: loss = 21.11541, step_time = 3.670310 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36703104972839357 sec.
STEP-10: loss = 21.11541, step_time = 3.689584 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.3689584493637085 sec.
STEP-10: loss = 21.11541, step_time = 3.675405 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36754045486450193 sec.
STEP-10: loss = 21.11541, step_time = 3.681213 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36812126636505127 sec.
STEP-10: loss = 21.11541, step_time = 3.629702 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3629701852798462 sec.
STEP-10: loss = 21.11541, step_time = 3.700365 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.37003653049468993 sec.
STEP-10: loss = 21.11541, step_time = 3.658189 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3658188819885254 sec.