Xirider / finetune-gpt2xl

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using [text,labels] instead of just [text] in Datasets

imthebilliejoe opened this issue · comments

Hi, I'd like to start with a big thanx for your amazing work. I would like to use your library to fine tune GPT-NEO to a Text2Text task instead of TextGeneration. I'm try to adapt your script run_clm.py to handle not only a Dataset with just [text] but with a structure [text,label].

So I'm now trying to create a train_dataset that is built by these new two tokenized dataset, built this way:

def tokenize_function_text(examples): return tokenizer(examples["text"])

tokenized_datasets_text = datasets.map( tokenize_function_text, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache)

def tokenize_function_label(examples): return tokenizer(examples["label"])

tokenized_datasets_label = datasets.map( tokenize_function_label, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache, )

But I'm now really struggling to mix them togheter in a single object "train_dataset" that i want to give to the trainer. Do you have any tips or suggestion to give me?

thank you very much

in case someone was trying to do the same i solved the issue adapting the code on this article:

http://mohitmayank.com/a_lazy_data_science_guide/natural_language_processing/GPTs/#finetuning-gpt-2-for-sentiment-classification