Xirider / finetune-gpt2xl

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't change BOS token or EOS token for GPT Neo

mallorbc opened this issue · comments

commented

In order to better control the start and stop of generated text, I have added BOS tokens and EOS tokens for GPT2xl. This works well and the generated text stops at an appropriate length and starts how a normal sentence would. However, I want to do this process on GPT Neo, and this does not work. I have discovered that for some reason arguments that normally set BOS and EOS are not working when GPT Neo is ran, even if I change the tokenizer from AutoTokenizer to GPT2Tokenizer. Below is some code that shows what I mean.

    tokenizer = GPT2Tokenizer.from_pretrained(
    model_args.model_name_or_path, bos_token='<|beginingtext|>',eos_token='<|endingtext|>', pad_token='<|pad|>',**tokenizer_kwargs)
    print(tokenizer.eos_token)
    print(tokenizer.bos_token)
    quit()

As I said, when I run this with GPT2xl, the tokens are appropriately changed. When I run this with GPT Neo, both the BOS and EOS tokens are <|endoftext|>

commented

After looking into this further, this may be a bug outside of this project. I am going to make an issue on the hugging face repo. I could be wrong though.

Not 100% sure about this, but according to https://github.com/finetuneanon/gpt-neo_finetune_2.7B#dataset-preparation there is no BOS token in GPT Neo.

commented

Thanks. Maybe its not a bug then. Without a BOS token and EOS token, I can still accomplish my goals, just takes a different, not as elegant method.
Thanks!