Can't change BOS token or EOS token for GPT Neo

Question

Can't change BOS token or EOS token for GPT Neo

mallorbc opened this issue 3 years ago · comments

In order to better control the start and stop of generated text, I have added BOS tokens and EOS tokens for GPT2xl. This works well and the generated text stops at an appropriate length and starts how a normal sentence would. However, I want to do this process on GPT Neo, and this does not work. I have discovered that for some reason arguments that normally set BOS and EOS are not working when GPT Neo is ran, even if I change the tokenizer from AutoTokenizer to GPT2Tokenizer. Below is some code that shows what I mean.

    tokenizer = GPT2Tokenizer.from_pretrained(
    model_args.model_name_or_path, bos_token='<|beginingtext|>',eos_token='<|endingtext|>', pad_token='<|pad|>',**tokenizer_kwargs)
    print(tokenizer.eos_token)
    print(tokenizer.bos_token)
    quit()

As I said, when I run this with GPT2xl, the tokens are appropriately changed. When I run this with GPT Neo, both the BOS and EOS tokens are <|endoftext|>

Blake · Answer 1 · Sat Jun 12 2021 02:10:56 GMT+0800 (China Standard Time)

After looking into this further, this may be a bug outside of this project. I am going to make an issue on the hugging face repo. I could be wrong though.

Benjamin Nater · Answer 2 · Sat Jun 12 2021 22:41:46 GMT+0800 (China Standard Time)

Not 100% sure about this, but according to https://github.com/finetuneanon/gpt-neo_finetune_2.7B#dataset-preparation there is no BOS token in GPT Neo.

Blake · Answer 3 · Mon Jun 14 2021 01:12:58 GMT+0800 (China Standard Time)

Thanks. Maybe its not a bug then. Without a BOS token and EOS token, I can still accomplish my goals, just takes a different, not as elegant method.
Thanks!