Roberta to Longformer positional embeddings copy convention
dysby opened this issue · comments
Hi,
I noticed you do not follow the Roberta to Longformer convention, "initialize the additional position embeddings by copying the embeddings of the first 512
positions".
You copy the [0:512] positions over and over, and append [512:514]
roberta2longformer/roberta2bigbird.py
Lines 52 to 57 in c9a8642
But, the original Roberta to Longformer code does it differently because positions [0, 1] are reserved.
Copying [2:514] over and over, leaving [0, 1] uninitialized or the same as original (depending on transformer/torch versions)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
k += step
Did you check the model performance in this setup?
Thanks,
Hi, no I wasn't aware of that, and thought that the last two pos-embeddings are the "spare ones". Hence I never checked the correct way.
Since I did additional MLM'ing after weight transfer, I would say that - in those cases - it isn't that much of an issue because the pos-embeddings are further refined anyways.
Yes, I think this should impact the model base performance, at least in my setup it did. Maybe in scenarios where additional training is done it would be reverted.
I also checked that Longformer implementation in transformers is derived from Roberta, so it's important to reserve the first two position embeddings. Every sequence will be shifted.
For BigBird and Nytromformer transformers implementation, the input sequence is not shifted, so they should be constructed by copying position_embeddings right from 0. But, one should keep in mind that Roberta to BigBird or Nystromformer conversion should only copy original roberta position_embeddings from index 2 onwards.
Thanks for sharing your code, i'm keeping changes in my own fork and would happily share