huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature] Fix support for sequence parallelism with MoEs

NouamaneTazi opened this issue · comments

Our current MoE implementation only works with tp_mode="ALL_REDUCE". We should fix the implementation when using tp_mode="REDUCE_SCATTER" to support sequence parallelism