The tokenizer's bug in Huggingface
Horikitasaku opened this issue · comments
The code in tokenizer
has a bug, it seem to be missing the [CLS]
the code in huggingface is
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
sep = [self.sep_token_id]
# cls = [self.cls_token_id]
result = token_ids_0 + sep
if token_ids_1 is not None:
result += token_ids_1 + sep
return result
but I think it should be
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
sep = [self.sep_token_id]
cls = [self.cls_token_id]
result = cls + token_ids_0 + sep
if token_ids_1 is not None:
result += token_ids_1 + sep
return result
according to the code from github
I'm not sure why the author commented the cls
But I think it's a bug
Please let me know if I'm wrong.
My idea is that since the model has the ability to fine-tune downstream classification tasks, we should keep the [CLS].