W&B sweeps giving OOM

Question

W&B sweeps giving OOM

Othergreengrasses opened this issue 3 months ago · comments

Arundhati Sengupta commented 3 months ago

I'm trying to run W&B sweeps for transformer. It was able to finish 5 sweeps successfully and failed on rest of 150 sweeps. Here is the error that I got after 5 sweeps -

Run x7n6tsz8 errored:
Traceback (most recent call last):
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/wandb/agents/pyagent.py", line 308, in _run_job
self._function()
File "/home/aru-sarthak/yoyodyne_041123/yoyodyne/examples/wandb_sweeps/./train_wandb_sweep.py", line 43, in train_sweep
best_checkpoint = train.train(trainer, model, datamodule, args.train_from)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/yoyodyne/train.py", line 253, in train
trainer.fit(model, datamodule, ckpt_path=train_from)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run
self.strategy.setup(self)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/strategies/single_device.py", line 73, in setup
self.model_to_device()
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/pytorch_lightning/strategies/single_device.py", line 70, in model_to_device
self.model.to(self.root_device)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 54, in to
return super().to(*args, **kwargs)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/aru-sarthak/anaconda3/envs/python_10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 15.69 GiB of which 4.38 MiB is free. Process 2063 has 211.51 MiB memory in use. Including non-PyTorch memory, this process has 15.23 GiB memory in use. Of the allocated memory 14.92 GiB is allocated by PyTorch, and 60.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Gist to reproduce the error -
https://gist.github.com/Othergreengrasses/4b950d336a112dd799b4120bcbeb60e7

Kyle Gorman · Answer 1 · Fri Apr 12 2024 07:39:38 GMT+0800 (China Standard Time)

Happy to take a look. Can you provide the hyperparameter YAML file and the command you used? (I didn't see that in the Gist, maybe I missed it though.)

Adam · Answer 2 · Fri Apr 12 2024 07:41:35 GMT+0800 (China Standard Time)

Your device ran out of memory. You need to lower the max batch size and accumulate gradients to simulate the requested batch size.

…

On Thu, Apr 11, 2024 at 5:39 PM Kyle Gorman ***@***.***> wrote: Happy to take a look. Can you provide the hyperparameter YAML file and the command you used? (I didn't see that in the Gist, maybe I missed it though.) — Reply to this email directly, view it on GitHub <#178 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIQBSW5ASH6Y5FTLDL2GNDY44NM7AVCNFSM6AAAAABGDFINAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQG4ZDCNJWGA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Arundhati Sengupta · Answer 3 · Fri Apr 12 2024 07:47:09 GMT+0800 (China Standard Time)

@kylebgorman
Github doesn't support YAML file type so I am just pasting whatever is in the file (btw I got the YAML file from you)

method: bayes
metric:
  name: val_accuracy
  goal: maximize
parameters:
  # Constants.
  arch:
    value: transformer
  max_epochs:
    value: 200
  patience:
    value: 40
  reduceonplateau_mode:
    value: accuracy
  gradient_clip_val:
    value: 3
  # Hyperparameters.
  attention_heads:
    values: [4, 6, 8]
  encoder_layers:
    values: [4, 6, 8]
  decoder_layers:
    values: [4, 6, 8]
  embedding_size:
    distribution: q_uniform
    q: 16
    min: 16
    max: 512
  hidden_size:
    distribution: q_uniform
    q: 64
    min: 64
    max: 1024
  dropout:
    distribution: uniform
    min: 0
    max: 0.5
  label_smoothing:
    distribution: uniform
    min: 0.0
    max: 0.2
  batch_size:
    distribution: q_uniform
    q: 128
    min: 128
    max: 2048
  learning_rate:
    distribution: log_uniform_values
    min: 0.00001
    max: 0.01
  scheduler:
    values: [null, reduceonplateau, warmupinvsqrt]
  reduceonplateau_factor:
    distribution: uniform
    min: 0.1
    max: 0.9
  reduceonplateau_patience:
    distribution: q_uniform
    q: 1
    min: 1
    max: 5
  min_lr:
    distribution: log_uniform_values
    min: 0.000001
    max: 0.001
  warmup_samples:
    distribution: q_uniform
    q: 100
    min: 100
    max: 5000000

command that I used -
wandb sweep --entity ENTITY --project Google_ben transformer_broader_config.yaml
./train_wandb_sweep.py --entity ENTITY --project Google_ben --sweep_id SWEEPID --model_dir models --experiment Google_transformer --train g2p-Google-train.tsv --val g2p-Google-dev.tsv --arch transformer --patience 10 --max_time 00:06:00:00 --count 200 --accelerator gpu --seed 1818 --source_sep ' ' --target_sep ' '

Kyle Gorman · Answer 4 · Fri Apr 12 2024 07:49:15 GMT+0800 (China Standard Time)

@Adamits this is a pretty straightforward grid for a transformer, I think, so it's a surprise to me this would OOM on the vast majority of runs on a... I believe @Othergreengrasses is on a 4th generation Nvidia card. (That said I don't have one at home so I can only replicate with a 1st gen card.)

Arundhati Sengupta · Answer 5 · Fri Apr 12 2024 07:53:30 GMT+0800 (China Standard Time)

Yes, Kyle you are right. I am on a 4th generation Nvidia card.

Also, the error is saying that I have 8.19 MB free space then I am not able to understand why it is shooting OOM.

I'm running the experiment on a fresh environment (python 3.10) installing yoyodyne from source.

Kyle Gorman · Answer 6 · Fri Apr 12 2024 07:58:16 GMT+0800 (China Standard Time)

You could try the suggestion to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (prepend it to whatever command you're running) if you haven't. I haven't tried this yet.

Kyle Gorman · Answer 7 · Fri Apr 12 2024 11:49:27 GMT+0800 (China Standard Time)

There's a panel in W&B for "GPU Memory Allocated (%)" (under "System"). If there is a memory leak across the multiple runs of a sweep, you'd expect that this would creep up as the number of sweeps increased. I just checked an old sweep (pre the supposed fix we put in place for this) and I don't see this pattern at all. Not sure what to make of that.

Arundhati Sengupta · Answer 8 · Sat Apr 13 2024 18:47:30 GMT+0800 (China Standard Time)

You could try the suggestion to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (prepend it to whatever command you're running) if you haven't. I haven't tried this yet.

This didn't work either.

Kyle Gorman · Answer 9 · Wed Apr 17 2024 11:07:22 GMT+0800 (China Standard Time)

After replicating some of these things on our lab machines (with 1080 GTXs, which have 8 GB of VRAM) I think that the transformer models we are working with are just too large for the combination of batch sizes, number of layers, and hidden layer dimensionalities. I am not seeing evidence of a leak. Automagical batch sizing (à la #148, but let's also stipulate that having found the max batch size it then picks the right size for mini-batches and the right number of mini-batches per batch) ought to handle this for good.

Kyle Gorman · Answer 10 · Mon Apr 22 2024 07:28:00 GMT+0800 (China Standard Time)

I'm going to close this for now. We can return to it later. #148 is the path forward...