Gate is Linear Layer?!?!

Question

Gate is Linear Layer?!?!

Eran-BA opened this issue 7 months ago · comments

I have 2 fundamental questions regarding your code in the repository..https://github.com/mistralai/mistral-src/tree/main/mistral/model.py

you implemented a gate such as a gate is a linear layer -- which doesn't make sense at all. because to decide which token is the best to be processed he should receive a transformer, a kind of, gate, (switch transformer maybe?), and not a linear gate.
secondly, you don't use GPTs as experts, but just regular linear layers.

Where is the full code?

Logan Hallee · Answer 1 · Tue Mar 19 2024 01:41:28 GMT+0800 (China Standard Time)

Hi! I have nothing to do with mistral but can answer your questions.

Gates or routers are always linear layers, even in switch transformers.

Regular linear layers, or sets of MLPs, are nearly always the experts. Sometimes experts have attention layers too or separate experts for attention. Usually, it is just MLPs and shared attention.

There is an implementation of Mixtral by huggingface (and regular mistral)
https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/models/mixtral/modeling_mixtral.py
Hope this helps