Training on a larger dataset fails due to memory issues on faster GPUs

Question

Training on a larger dataset fails due to memory issues on faster GPUs

jonnyplatt opened this issue 3 years ago · comments

Thanks so much for producing this repo, it's been really helpful in getting up and running on the biggest GPT-Neo model.

I'm having an issue training gpt-neo_2-7B though - my dataset is just over 200mb, which leads to an out of memory issue on the very last step of loading a model into memory before training.

[INFO|integrations.py:533] 2021-04-20 12:40:32,650 >> Attempting to resume from paragraphs/checkpoint-600 [2021-04-20 12:40:32,664] [INFO] [engine.py:1445:_load_checkpoint] rank: 0 loading checkpoint: paragraphs/checkpoint-600/global_step600/mp_rank_00_model_states.pt Traceback (most recent call last): File "run_clm.py", line 478, in <module> main() File "run_clm.py", line 441, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) [...] RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 10605230080 bytes. Error code 12 (Cannot allocate memory)

I've tried a number of GPUs on Google cloud, and I can get it run on the P100 since I can up the RAM to 100GB, but both the V100 and A100s fail (with 78GB and 85GB respectively)

Unfortunately Google puts a hard limit on RAM for these GPUs, and increasing the number of GPUs also doubles the number of processes run and so the RAM required - so unless I pay for 2 GPUs and let one sit idle I have to train on the much slower P100.

This is .. ok .. 😅 but I'd love to go faster if I can. So far I've tried:

Reducing per_device_train_batch_size to 2
Halving the dataset size
but neither have made a difference.

Do you have any other tips on how I might squeeze into the 85GB you get with an A100? It's so tantalizingly close - I wish Google would just let me add more RAM!

Peter Albert · Answer 1 · Tue Apr 20 2021 22:05:49 GMT+0800 (China Standard Time)

Hi!
you could try to reduce allgather_bucket_size and reduce_bucket_size from 5e7 to 2e7 in the ds_config_gptneo.json. You could also try to further reduze the train_batch_size to 1 and further reduce the size of your dataset.

Also, while there is a limit for the number of CPUs of 12 per V100, and 8 GB per CPU, google also offers extended memory which can go above that (up to 624 GB for the N1 instance type). For this you need to change these flags when creating your instance:
--custom-extensions --custom-cpu 12 --custom-memory 100

Let me know if these things work :)

jonnyplatt · Answer 2 · Tue Apr 20 2021 23:43:05 GMT+0800 (China Standard Time)

Thanks, that really helped put me on the right track. In the end I had to slash the size of the dataset - it's now down to 45M and running with a per gpu batch size of 1

I tried the allgather / reduce bucket sizes of 2e7 but encountered this error:
RuntimeError: start (0) + length (128657920) exceeds dimension size (90000000).
so I restored it to 5e7 and combined with the smaller dataset its now running.

Best of all, where I was getting '35s/it' on a P100 it's now storming along at '5s/it' on the A100. So about 7x the speed but only 2x the price - very nice!

google also offers extended memory which can go above that (up to 624 GB for the N1 instance type). For this you need to change these flags when creating your instance:
--custom-extensions --custom-cpu 12 --custom-memory 100

I wonder if this will work directly from the CLI ? I didn't try as I already had checkpoints saved on my instance, but I did attempt to edit the instance I'd created - the web UI was definitely limiting me to 78gb per V100 - the A100 aren't available as custom images so are also limited to 85g.

Peter Albert · Answer 3 · Wed Apr 21 2021 04:17:38 GMT+0800 (China Standard Time)

Awesome :)

@jonnyplatt

I wonder if this will work directly from the CLI ? I didn't try as I already had checkpoints saved on my instance, but I did attempt to edit the instance I'd created - the web UI was definitely limiting me to 78gb per V100 - the A100 aren't available as custom images so are also limited to 85g.

Yes, you can only do it from the CLI. But I am not sure if there is an "extended memory" version with the A100, as they use the A2 CPU. But for the V100 (which uses N1 CPUs) it should definitely work with up to 624 GB RAM. Using an V100 with about 150 GB RAM should allow you to use batch size of 4 or so and your full data set. Here is the full cli command for it:

gcloud compute instances create gpuserver \
   --project YOURPROJECTID \
   --zone us-west1-b \
   --custom-cpu 12 \
   --custom-memory 150 \
   --maintenance-policy TERMINATE \
   --image-family pytorch-1-7-cu110 \
   --image-project deeplearning-platform-release \
   --boot-disk-size 200GB \
   --metadata "install-nvidia-driver=True" \
   --accelerator="type=nvidia-tesla-v100,count=1" \
   --custom-extensions