Issue in running finetune paraphrase script

Question

Issue in running finetune paraphrase script

abhisha1991 opened this issue 3 years ago · comments

Hey Kalpesh and team,

Thanks very much for releasing your work - it is great to see a simple architecture like this being implemented for something novel. We're trying to just get up and running with the base set up - we have downloaded all the data and corresponding models to the right folders. However, upon running the fine tune training, we get the attached error

Our setup is a cloud VM with 1 GPU core (Nvidia Tesla T4), Ubuntu 18.04, 7.5 GB RAM, pytorch 1.10, cuda 11.5
We have confirmed pytorch is installed and available along with CUDA on the machine (see attachments)

We'd be incredibly grateful if you could release a docker image with pre-installed dependencies or tell us the exact error mode we are facing below. We're unable to proceed past this error. We're also unable to locate the error logs here (~/style-transfer-paraphrase/style_paraphrase/logs) and thus unable to understand what is wrong with our setup

Kalpesh Krishna · Answer 1 · Thu Nov 04 2021 19:54:21 GMT+0800 (China Standard Time)

Hi @abhisha1991,
Unfortunately I've not encountered the error before so I'm not 100% sure the following will work. But they are still worth a try ---

Try downgrading PyTorch to 1.7.. I've confirmed it works on my cluster with PyTorch 1.7 / CUDA 10.1.
Try removing the DDP dependencies from the command, remove -m torch.distributed.launch --nproc_per_node=1 from the bash script. That way there will only be a single PyTorch process running the code. That way args.local_rank will be automatically set to -1. If this gives you any error, let me know
The CPU RAM seems quite low (7GB), so I'm wondering if you are getting an OOM error in a child process.

Solution 2 is probably much lesser work so I suggest trying that first.

Tufail Ahmad Siddiq · Answer 2 · Thu Dec 29 2022 02:58:29 GMT+0800 (China Standard Time)

Hello @martiansideofthemoon
I hope you are fine and doing great. I am facing another problem related to the run_finetune_paraphrase.sh. I am trying to run that in Google Colab and on execution it takes a few seconds and shows that CUDA GPU is out of memory. I also have experimented with changing batch size in the sh file but it didn't work. I would appreciate your help with that.

Kalpesh Krishna · Answer 3 · Thu Dec 29 2022 14:08:06 GMT+0800 (China Standard Time)

hi @TufailAhmadSiddiq , what's the smallest batch size you tried? Reducing batch size is ok since you can do gradient accumulation to have a larger effective batch size

Tufail Ahmad Siddiq · Answer 4 · Thu Dec 29 2022 15:06:52 GMT+0800 (China Standard Time)

Thanks for the reply. The minimum batch size I have used is 2 but still facing the same problem

Kalpesh Krishna · Answer 5 · Thu Dec 29 2022 18:36:28 GMT+0800 (China Standard Time)

This is with GPT2-large? As long as batch size 1 fits, it should be ok. You can also change GPT2-large to GPT2-medium, it doesn't drop performance much. Another solution could be gradient checkpointing

Tufail Ahmad Siddiq · Answer 6 · Thu Dec 29 2022 19:16:08 GMT+0800 (China Standard Time)

Can you please point out where I should make changes to work it perfectly?

Kalpesh Krishna · Answer 7 · Thu Dec 29 2022 19:31:08 GMT+0800 (China Standard Time)

Change this to gpt2-medium: https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/style_paraphrase/examples/run_finetune_paraphrase.sh#L23

and reduce batch size here: https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/style_paraphrase/examples/run_finetune_paraphrase.sh#L32

Tufail Ahmad Siddiq · Answer 8 · Thu Dec 29 2022 19:57:25 GMT+0800 (China Standard Time)

Thanks for the guidance. I do it and check whether it works or not.

Hassan Bin Ali · Answer 9 · Sat Feb 04 2023 17:21:17 GMT+0800 (China Standard Time)

Hello! Hope you are doing good. I am trying to fine tune your model on my custom dataset. When I run !style_paraphrase/examples/run_finetune_paraphrase.sh, I get the following error:

I followed first two steps of "Custom Datasets" in this repository. At third step, while converting BPE code to fairseq binaries, "Permission denied" occurs.

Kalpesh Krishna · Answer 10 · Sat Feb 04 2023 23:18:33 GMT+0800 (China Standard Time)

@HassanBinAli i think you are missing the dataset files in the repo. Please download the train.pickle file from here and place it in datasets/paranmt_filtered/train.pickle.

Hassan Bin Ali · Answer 11 · Wed Feb 08 2023 14:28:20 GMT+0800 (China Standard Time)

Thank You. It resolved the error.