mistralai / mistral-inference

Official inference library for Mistral models

Home Page:https://mistral.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gate is Linear Layer?!?!

Eran-BA opened this issue · comments

I have 2 fundamental questions regarding your code in the repository..https://github.com/mistralai/mistral-src/tree/main/mistral/model.py

  1. you implemented a gate such as a gate is a linear layer -- which doesn't make sense at all. because to decide which token is the best to be processed he should receive a transformer, a kind of, gate, (switch transformer maybe?), and not a linear gate.

  2. secondly, you don't use GPTs as experts, but just regular linear layers.

Where is the full code?

Hi! I have nothing to do with mistral but can answer your questions.

Gates or routers are always linear layers, even in switch transformers.

Regular linear layers, or sets of MLPs, are nearly always the experts. Sometimes experts have attention layers too or separate experts for attention. Usually, it is just MLPs and shared attention.

There is an implementation of Mixtral by huggingface (and regular mistral)
https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/models/mixtral/modeling_mixtral.py
Hope this helps