Positional Embeddings should be MuReadout parameters ?

Question

Positional Embeddings should be MuReadout parameters ?

codedecde opened this issue a year ago · comments

Duplicate of question asked on the mutransformers repository (link)

Hi !
I was wondering if (learned) positional embeddings should be MuReadout layers, since they map to a finite dimensional space. Specifically

https://github.com/microsoft/mutransformers/blob/480287ce7b18a07a3432e8f2fbc0f0e5b71e2599/mutransformers/models/bert/modeling_bert.py#L174

self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)

In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).

Would be grateful for any advice :)

Thank you !

Greg Yang · Answer 1 · Thu Jun 01 2023 15:30:22 GMT+0800 (China Standard Time)

Position embedding maps to an infinite dimension (config.hidden_size). Why do you say its finite? Yes routing gate should be MuReadout.

…

On Thu, Jun 1, 2023, 12:41 AM Barun Patra ***@***.***> wrote: Duplicate of question asked on the mutransformers repository (link <microsoft/mutransformers#3 (comment)>) Hi ! I was wondering if (learned) positional embeddings should be MuReadout layers, since they map to a finite dimensional space. Specifically https://github.com/microsoft/mutransformers/blob/480287ce7b18a07a3432e8f2fbc0f0e5b71e2599/mutransformers/models/bert/modeling_bert.py#L174 self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts). Would be grateful for any advice :) Thank you ! — Reply to this email directly, view it on GitHub <#48>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMWHHM6LLOILTYMSMQ46TP3XI7CLLANCNFSM6AAAAAAYWDHITI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Barun Patra · Answer 2 · Fri Jun 02 2023 06:20:17 GMT+0800 (China Standard Time)

Thank you !
I meant that the sequence length aspect is finite (similar to vocab size) ?