Hardcoded padding tokens
lfoppiano opened this issue · comments
I've noticed that the label to pad is hard-coded and in certain tokenizer it may be different (e.g. <s>
).
For example (preprocess.py:343 and few other places below):
label_ids.append("<PAD>")
perhaps we should change it as:
label_ids.append(self.tokenizer.pad_token)
This was found when investigating #150 , I'm asking because I'm not 100% sure this is actually a problem.
Hi @lfoppiano !
This here is just a empty label id because it corresponds to a subtoken ("empty" labels are assigned to special tokens and subtokens introduced by the tokenizer). The labels are not related to the tokenizer, so we should not use self.tokenizer.pad_token.
See also #150 (comment) for more details.
Thanks for the clarification!
I close this, we can eventually comment in #150