kermitt2 / delft

a Deep Learning Framework for Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hardcoded padding tokens

lfoppiano opened this issue · comments

I've noticed that the label to pad is hard-coded and in certain tokenizer it may be different (e.g. <s>).

For example (preprocess.py:343 and few other places below):

label_ids.append("<PAD>")

perhaps we should change it as:

label_ids.append(self.tokenizer.pad_token)

This was found when investigating #150 , I'm asking because I'm not 100% sure this is actually a problem.

Hi @lfoppiano !

This here is just a empty label id because it corresponds to a subtoken ("empty" labels are assigned to special tokens and subtokens introduced by the tokenizer). The labels are not related to the tokenizer, so we should not use self.tokenizer.pad_token.

See also #150 (comment) for more details.

Thanks for the clarification!
I close this, we can eventually comment in #150