CUDA out of memory

Question

CUDA out of memory

Aditya-iitdh opened this issue 11 days ago · comments

Dear authors,

Thanks for this amazing work. I am interested in reproducing the results of your paper but I am getting torch.cuda.OutOfMemoryError

Currently I am trying to run it on Google Colab with a T4 GPU.
Do you think using one more GPU can solve this problem? If not, what else can I try?
Below is the complete output mesage:

/bin/bash: line 2: fg: no job control
mkdir: cannot create directory ‘logs/TEMPO/loar_revin_100_percent_1_prompt_equal_1/’: File exists
mkdir: cannot create directory ‘logs/TEMPO/loar_revin_100_percent_1_prompt_equal_1/ettm2_pmt1_no_pool_TEMPO_6’: File exists
logs/TEMPO/loar_revin_100_percent_1_prompt_equal_1/ettm2_pmt1_no_pool_TEMPO_6/test_336_96_lr0.001.log
Namespace(model_id='etth1_TEMPO_6_prompt_learn_336_96_100', checkpoints='./lora_revin_6domain_checkpoints_1/', task_name='long_term_forecast', prompt=1, num_nodes=1, seq_len=336, pred_len=96, label_len=168, decay_fac=0.5, learning_rate=0.001, batch_size=256, num_workers=0, train_epochs=10, lradj='type3', patience=5, gpt_layers=6, is_gpt=1, e_layers=3, d_model=768, n_heads=4, d_ff=768, dropout=0.3, enc_in=7, c_out=1, patch_size=16, kernel_size=25, loss_func='mse', pretrain=1, freeze=1, model='TEMPO', stride=8, max_len=-1, hid_dim=16, tmax=20, itr=1, cos=1, equal=1, pool=False, no_stl_loss=False, stl_weight=0.001, config_path='./configs/multiple_datasets.yml', datasets='ETTm1,ETTh2,ETTm2,electricity,traffic,weather', target_data='ETTh1', use_token=0, electri_multiplier=1, traffic_multiplier=1, embed='timeF')
['ETTm1', 'ETTh2', 'ETTm2', 'electricity', 'traffic', 'weather']
ETTm1
dataset: ett_m
train 238903
val 79975
ETTh2
dataset: ett_h
self.enc_in = 7
self.data_x = (8640, 7)
train 57463
self.enc_in = 7
self.data_x = (3216, 7)
val 19495
ETTm2
dataset: ett_m
train 238903
val 79975
electricity
dataset: custom
train 5771901
val 814377
traffic
dataset: custom
train 10213838
val 1431782
weather
dataset: custom
train 765576
val 108675
ETTm1
dataset: ett_m
train 238903
ETTh2
dataset: ett_h
self.enc_in = 7
self.data_x = (8640, 7)
train 57463
ETTm2
dataset: ett_m
train 238903
electricity
dataset: custom
train 5771901
traffic
dataset: custom
train 10213838
weather
dataset: custom
train 765576
Way1 1251978
self.enc_in = 7
self.data_x = (3216, 7)
test 19495
trainable params: 308736 || all params: 82207488 || trainable%: 0.38
0% 0/4891 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/content/TEMPO/main_multi_6domain_release.py", line 292, in
outputs, loss_local = model(batch_x, ii, seq_trend, seq_seasonal, seq_resid) #+ model(seq_seasonal, ii) + model(seq_resid, ii)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/content/TEMPO/models/TEMPO.py", line 446, in forward
x = self.gpt2_trend(inputs_embeds =x_all).last_hidden_state
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 642, in forward
return self.get_base_model()(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 1116, in forward
outputs = block(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 651, in forward
feed_forward_hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 572, in forward
hidden_states = self.act(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/activations.py", line 56, in forward
return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 460.00 MiB. GPU

Defu Cao · Answer 1 · Tue Jul 16 2024 08:45:55 GMT+0800 (China Standard Time)

Hi Aditya,

Thanks for your interest in our work! For the OOM, can you try to reduce the batch size to see if it is helpful?

Best

Aditya-iitdh · Answer 2 · Tue Jul 16 2024 13:29:28 GMT+0800 (China Standard Time)

Thanks, it worked !