XueFuzhao / OpenMoE

A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some questions want to know

YixinSong-e opened this issue · comments

I am very excited to see your work on MoE. I would like to ask, currently our 8B MoE model has 32 experts, how many experts are activated at each time?

Can't wait to see openMoE-8B trained with 1T token. :)

Thank you so much for your interest! 2 of the 32 experts are activated on each MoE layer. See this gin file and other gin file dependency for the details:
https://github.com/XueFuzhao/flaxformer/blob/main/flaxformer/t5x/configs/moe/models/tokens_choose_decoder_only_large.gin

The 1T model is coming soon. One efficient PyTorch OpenMoE Support is also on the way. Maybe you can see it within around one month.

Sounds nice! Actually I have a question.
Have we tried to refine the granularity of experts? For example, setting up 64 experts to activate 8 of them, or 128 experts to activate 9 similar attempts. In this way, we make each expert smaller.
If our experts have finer granularity, I think there may be more opportunities for optimization in inference.

I personally don't think using a lot of very small experts is a very efficient way to deploy MoE. We know that the routing itself has computational overhead. Therefore, we have to spend a higher ratio of FLOPs on routing instead of the useful FFN compute. Based on my engineering experience, this is not very good for a pre-training backbone.

In addition, some new works focus on Lora-based MoE at the fine-tuning stage. That is another line of research and more about instruction fine-tuning, which is very different from the MoE pre-training

Thank you for your reply!
Actually routing FFN maybe is a small MLP. I don't think it will introduce a lot of compute. And if we have a finer granularity small experts, I think it will have more optimization space for serving system.

Thanks for the insight! This is a very interesting and useful question.

I respectfully disagree with the opinion that "Actually routing FFN maybe is a small MLP. I don't think it will introduce a lot of computation."

First of all, YES, router will only introduce a small amount of FLOPs. However, a small amount of FLOPs doesn't mean a small cost or latency. Since the advanced GPUs and TPUs are more optimized for "parallel large matrix computation", usually, a small matrix computation is not much faster than a larger one. For instance, based on my experience, the throughput of an nn.Linear(16, 16) is usually not much faster than nn.Linear(256, 256) in practice if we run them in an end-to-end manner, although the latter one has 256 times FLOPs. Therefore, I don't think using small experts is a more computation-efficient design.

Also, the routing operation actually requires a lot of one-hot vectors and other complicated kernel calls, which is not hardware friendly and leads to low hardware utilization.

Last, more experts will introduce more communication requests. The ALL2ALL communication of MoE is actually very expensive and slow. If the experts are very small, we actually will spend a lot of communication cost but only spend a small amount of time on computation. As we know, more computation is a super-important component of scaling and pre-training. This would make the training less efficient and then result in a worse model at last, in terms of the cost-effectiveness trade-off.

So in summary, many small experts should work, but only work well on the fine-tuning stage. Just like Lora-MoE. For a pre-training MoE backbone, we'd better apply more computation on FFN and Attention layers.

Thank you! These are just some of my curiosities. Looking forward to this model! :)
I just thought that if there were more fine-grained experts, I might have a way to reduce deployment costs. And I want to do some proof of concept work to check my solution on decoder-only MoE model.