The tokenizer's bug in Huggingface

Question

The tokenizer's bug in Huggingface

Horikitasaku opened this issue 6 months ago · comments

has a bug, it seem to be missing the [CLS]

the code in huggingface is

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        # cls = [self.cls_token_id]
        result = token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

but I think it should be

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        result = cls + token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

according to the code from github

Horikita_Saku · Answer 1 · Tue Feb 13 2024 20:07:32 GMT+0800 (China Standard Time)

I'm not sure why the author commented the cls
But I think it's a bug
Please let me know if I'm wrong.

My idea is that since the model has the ability to fine-tune downstream classification tasks, we should keep the [CLS].