support alternative parallelism
152334H opened this issue · comments
152334H commented
--num-gpus
is implemented by sharding each expert layer across GPUs, i.e. expert parallelism
this is probably not advisable for local experimentation, especially on batch size 1 -- where EP only adds communication overhead to no speed benefit vs naive model/pipeline parallel.
Songyang Zhang commented
Good suggestions, I am working on other parallelism method. Also, contribution is welcomed.