HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena

Home Page:https://arxiv.org/abs/2306.15794

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The tokenizer's bug in Huggingface

Horikitasaku opened this issue · comments

The code in tokenizer

has a bug, it seem to be missing the [CLS]

the code in huggingface is

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        # cls = [self.cls_token_id]
        result = token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

but I think it should be

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        result = cls + token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

according to the code from github

I'm not sure why the author commented the cls
But I think it's a bug
Please let me know if I'm wrong.

My idea is that since the model has the ability to fine-tune downstream classification tasks, we should keep the [CLS].