LennartKeller / roberta2longformer

Convert pretrained RoBerta models to various long-document transformer models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Roberta to Longformer positional embeddings copy convention

dysby opened this issue · comments

commented

Hi,
I noticed you do not follow the Roberta to Longformer convention, "initialize the additional position embeddings by copying the embeddings of the first 512 positions".

You copy the [0:512] positions over and over, and append [512:514]

roberta_pos_embs = roberta_model.base_model.embeddings.state_dict()[
"position_embeddings.weight"
][:-2]
roberta_pos_embs_extra = roberta_model.base_model.embeddings.state_dict()[
"position_embeddings.weight"
][-2:]

But, the original Roberta to Longformer code does it differently because positions [0, 1] are reserved.

Copying [2:514] over and over, leaving [0, 1] uninitialized or the same as original (depending on transformer/torch versions)

# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
   new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
   k += step

Did you check the model performance in this setup?

Thanks,

Hi, no I wasn't aware of that, and thought that the last two pos-embeddings are the "spare ones". Hence I never checked the correct way.

Since I did additional MLM'ing after weight transfer, I would say that - in those cases - it isn't that much of an issue because the pos-embeddings are further refined anyways.

commented

Yes, I think this should impact the model base performance, at least in my setup it did. Maybe in scenarios where additional training is done it would be reverted.

I also checked that Longformer implementation in transformers is derived from Roberta, so it's important to reserve the first two position embeddings. Every sequence will be shifted.

For BigBird and Nytromformer transformers implementation, the input sequence is not shifted, so they should be constructed by copying position_embeddings right from 0. But, one should keep in mind that Roberta to BigBird or Nystromformer conversion should only copy original roberta position_embeddings from index 2 onwards.

Thanks for sharing your code, i'm keeping changes in my own fork and would happily share