A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
Geek Repo:Geek Repo
Github PK Tool:Github PK Tool
VRCMF opened this issue 3 years ago · comments
In the code Error, it cause the failure of deriving the back gradient.
Solution: density_1_proxy = density_1_proxy*equals_one_mask[..., None]
@VRCMF thanks Wei! 04201ee