LTH14 / mar

PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

model and training code for the AR variant

MikeWangWZHL opened this issue · comments

Thanks for open-sourcing this amazing project!
I wonder it is possible to also release model and training code for the AR baseline
image

Thank you in advance!

To keep this repo clean, we don't have a plan to release the AR code in this repo. However, it is very easy to reimplement it using the current repo -- almost all hyper-parameters remain the same as MAR. The only difference is the causal attention mask and the teacher-forcing loss.

Hi @LTH14, in the AR variant, is it necessary for the attention mechanism within the MAE encoder to be causal? Alternatively, should we consider removing the MAE encoder altogether in this variant?

In the AR variant, we don't need the MAE encoder. A single causal decoder is enough (similar to GPT).

Thanks! Do you double the depth of MAE decoder?

Yes we keep the total number of parameters unchanged