Hardcoded padding tokens

Question

Hardcoded padding tokens

lfoppiano opened this issue a year ago · comments

I've noticed that the label to pad is hard-coded and in certain tokenizer it may be different (e.g. <s>).

For example (preprocess.py:343 and few other places below):

label_ids.append("<PAD>")

perhaps we should change it as:

label_ids.append(self.tokenizer.pad_token)

This was found when investigating #150 , I'm asking because I'm not 100% sure this is actually a problem.

Patrice Lopez · Answer 1 · Fri Jan 13 2023 18:48:56 GMT+0800 (China Standard Time)

Hi @lfoppiano !

This here is just a empty label id because it corresponds to a subtoken ("empty" labels are assigned to special tokens and subtokens introduced by the tokenizer). The labels are not related to the tokenizer, so we should not use self.tokenizer.pad_token.

Patrice Lopez · Answer 2 · Sat Jan 14 2023 04:22:48 GMT+0800 (China Standard Time)

See also #150 (comment) for more details.

Luca Foppiano · Answer 3 · Wed Jan 18 2023 08:13:25 GMT+0800 (China Standard Time)

Thanks for the clarification!
I close this, we can eventually comment in #150