model and training code for the AR variant
MikeWangWZHL opened this issue · comments
To keep this repo clean, we don't have a plan to release the AR code in this repo. However, it is very easy to reimplement it using the current repo -- almost all hyper-parameters remain the same as MAR. The only difference is the causal attention mask and the teacher-forcing loss.
Hi @LTH14, in the AR variant, is it necessary for the attention mechanism within the MAE encoder to be causal? Alternatively, should we consider removing the MAE encoder altogether in this variant?
In the AR variant, we don't need the MAE encoder. A single causal decoder is enough (similar to GPT).
Thanks! Do you double the depth of MAE decoder?
Yes we keep the total number of parameters unchanged