some confused of <CUDA out of memory>

Question

some confused of <CUDA out of memory>

raullese opened this issue 2 years ago · comments

Hi, when I use train_mbart_model.sh to continue pre train the mBart-50, after 300k batches, the error occurred as:RuntimeError: CUDA out of memory. Tried to allocate 1.90 GiB (GPU 0; 39.44 GiB total capacity; 19.93 GiB already allocated; 1.31 GiB free; 36.14 GiB reserved in total by PyTorch), and at this point, one epoch has been complete; and then I restart the code to continue pre train with batch size from 2048 to 1024 and with checkpoint that just generated; I'm confused, why this problem suddenly appeared after the task ran for so long, or is there some hidden problem in the program?
I don't know why this happens, so I'm worried that this problem will occur after the task starts again. and I don't know whether it work when I only change the batch size;

I use 4 GPUS: A100, NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7;
and total memory of per GPU is : 40G;

and during this time, there probably no other tasks to preempt resources

and my script setting is:
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Change to the GPU ID corresponding to a GPU that is free. export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' nohup python pretrain_nmt_new.py -n 1 -nr 0 -g 4 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 2048 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --num_batches 10000000 --save_intermediate_checkpoints --data_sampling_temperature 1.0 --hard_truncate_length 512 --max_length 512 --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &

Raj Dabre · Answer 1 · Fri Sep 02 2022 12:41:14 GMT+0800 (China Standard Time)

Hi,

Can you send me your logs?

Last time I tested mbart-50 training, 1024 was the maximum batch size I could train on with my 32 GB. Even then I would run out of memory on and off. So 40 GB should be ok with 2048 size batches but I cant be 100% sure. Its 99% not a problem with my code. As to why this happens sporadically? 2048 might be at the edge of your GPUs maximum capacity and sometimes when the pytorch allocator tries to allocate memory while at the edge of capacity, OOMs happen. That being said, a crash at 300k seems to be a one off thing that happens with fairseq too.

Also you seem to have modified the pretraining script: pretrain_nmt_new.py
I cant be sure if thats related to the problem or not.

I also have questions about your command:

Why is --encoder_ffn_dim=128? Shouldnt it be 4096?
pretrain_model/mbart-50 --> This is likely your own folder containing the tokenizer and last checkpoint. Actually I dont recommend this because it means you will have lost your previous optimizer states. It makes me realize that my code was not really designed to resume training of a crashed run of fine-tuning an mbart-50 model.

If I were you I would do the following:

python pretrain_nmt_new.py -n 1 -nr 0 -g 4 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=4096 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 1024 --pretrained_model <path to the last trained checkpoint without the pure_model suffix> --long_save_every 50000 --num_batches 10000000 --save_intermediate_checkpoints --data_sampling_temperature 1.0 --hard_truncate_length 512 --max_length 512 --shard_files

Then I would go to the following code block:

else: if "albert" in args.tokenizer_name_or_path: tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True) elif "mbart" in args.tokenizer_name_or_path: tok = MBartTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)

I would change the elif part to use the AutoTokenizer or create an if else under the elif part to use the MBART50 tokenizer if "50" is present in the tokenizer name.

This is a temporary fix I know. I will consider fixing the whole flow to make it easier to resume the crashed fine-tuning runs of official models.

Raj Dabre · Answer 2 · Fri Sep 02 2022 12:58:11 GMT+0800 (China Standard Time)

A small edit:

In your previous run there should be a folder with the suffix "_deploy". Set --tokenizer_name_or_path to this folder and you can resume training.

I am also considering splitting the --use_official_pretrained flag into two, one for tokenizer and one for model so that you can use the official tokenizer but a non official locally fine tuned model which would be perfect for your situation which needs resuming a crashed fine tuning run of an official model. This will eliminate the need for modifying the code.

raullese · Answer 3 · Fri Sep 02 2022 13:42:00 GMT+0800 (China Standard Time)

Thanks for your patience @prajdabre

the following is part of my run_train.log:

302000 2.69 83.57 seconds for 100 batches. Memory used post forward / backward passes: 18.2 / 16.49 GB. 302100 2.56 48.22 seconds for 100 batches. Memory used post forward / backward passes: 18.27 / 16.37 GB. 302200 2.61 47.66 seconds for 100 batches. Memory used post forward / backward passes: 18.21 / 16.47 GB. 302300 2.65 47.51 seconds for 100 batches. Memory used post forward / backward passes: 18.18 / 16.49 GB. 302400 2.7 47.25 seconds for 100 batches. Memory used post forward / backward passes: 18.23 / 16.45 GB. 302500 2.55 47.33 seconds for 100 batches. Memory used post forward / backward passes: 18.08 / 16.47 GB. 302600 2.5 47.62 seconds for 100 batches. Memory used post forward / backward passes: 18.13 / 16.5 GB. 302700 2.65 47.4 seconds for 100 batches. Memory used post forward / backward passes: 18.08 / 16.4 GB. 302800 2.6 48.05 seconds for 100 batches. Memory used post forward / backward passes: 18.36 / 16.49 GB. 302900 2.57 47.87 seconds for 100 batches. Memory used post forward / backward passes: 17.99 / 16.42 GB. Saving the model Loading from checkpoint load ckp2 model success Loading from checkpoint load ckp2 model success Loading from checkpoint load ckp2 model success Loading from checkpoint load ckp2 model success 303000 2.53 82.88 seconds for 100 batches. Memory used post forward / backward passes: 18.02 / 16.43 GB. 303100 2.64 48.8 seconds for 100 batches. Memory used post forward / backward passes: 18.2 / 16.46 GB. /****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). warnings.warn("To get the last learning rate computed by the scheduler, " /****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step()beforeoptimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()beforelr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
//miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
Traceback (most recent call last):
File "pretrain_nmt_new.py", line 970, in
run_demo()
File "pretrain_nmt_new.py", line 967, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "//miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "//miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "//miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "//miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "//mt_mbart/yanmtt/pretrain_nmt_new.py", line 637, in model_create_load_run_save
loss.backward()
File "//miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/*/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.90 GiB (GPU 0; 39.44 GiB total capacity; 19.93 GiB already allocated; 1.31 GiB free; 36.14 GiB reserved in total by PyTorch)
`

I don't know if this is the log you want to see,
and when task start running, there generates ****/mbart-50_deploy/config.json as following:

{ "_name_or_path": "pretrain_model/mbart-50", "_num_labels": 3, "activation_dropout": 0.1, "activation_function": "gelu", "adaptor_dropout": 0.1, "adaptor_hidden_size": 512, "adaptor_init_std": 0.02, "adaptor_scaling_factor": 1.0, "adaptor_tuning": false, "add_bias_logits": false, "add_final_layer_norm": true, "additional_source_wait_k": -1, "architectures": [ "MBartForConditionalGeneration" ], "attention_dropout": 0.1, "bos_token_id": 0, "bottleneck_mid_fusion_tokens": 4, "classif_dropout": 0.0, "classifier_dropout": 0.0, "d_model": 1024, "decoder_adaptor_tying_config": null, "decoder_attention_heads": 16, "decoder_ffn_dim": 4096, "decoder_layerdrop": 0.0, "decoder_layers": 12, "decoder_start_token_id": 2, "decoder_tying_config": null, "deep_adaptor_tuning": false, "deep_adaptor_tuning_ffn_only": false, "dropout": 0.1, "early_stopping": true, "embed_low_rank_dim": 0, "encoder_adaptor_tying_config": null, "encoder_attention_heads": 16, "encoder_ffn_dim": 4096, "encoder_layerdrop": 0.0, "encoder_layers": 12, "encoder_tying_config": null, "eos_token_id": 2, "expert_ffn_size": 128, "features_embed_dims": null, "features_vocab_sizes": null, "forced_eos_token_id": 2, "gradient_checkpointing": false, "gradient_reversal_for_domain_classifier": false, "hypercomplex": false, "hypercomplex_n": 2, "ia3_adaptors": false, "id2label": { "0": "LABEL_0", "1": "LABEL_1", "2": "LABEL_2" }, "init_std": 0.02, "is_encoder_decoder": true, "label2id": { "LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2 }, "layernorm_adaptor_input": false, "layernorm_prompt_projection": false, "max_length": 200, "max_position_embeddings": 1024, "mid_fusion_layers": 3, "model_type": "mbart", "moe_adaptors": false, "multi_source": false, "multi_source_method": null, "multilayer_softmaxing": null, "no_embed_norm": false, "no_positional_encoding_decoder": false, "no_positional_encoding_encoder": false, "no_projection_prompt": false, "no_scale_attention_embedding": false, "normalize_before": true, "normalize_embedding": true, "num_beams": 5, "num_domains_for_domain_classifier": -1, "num_experts": 8, "num_hidden_layers": 12, "num_moe_adaptor_experts": 4, "num_prompts": 100, "output_past": true, "pad_token_id": 1, "parallel_adaptors": false, "positional_encodings": false, "prompt_dropout": 0.1, "prompt_init_std": 0.02, "prompt_projection_hidden_size": 4096, "prompt_tuning": false, "recurrent_projections": 1, "residual_connection_adaptor": false, "residual_connection_prompt": false, "scale_embedding": true, "softmax_bias_tuning": false, "softmax_temperature": 1.0, "static_position_embeddings": false, "temperature_calibration": false, "tokenizer_class": "MBart50Tokenizer", "transformers_version": "4.3.2", "unidirectional_encoder": false, "use_cache": true, "use_moe": false, "use_tanh_activation_prompt": false, "vocab_size": 250054, "wait_k": -1 }

Then I will explain your question 1, why is --encoder_ffn_dim=128, previous time I just set a value at random for --encoder_ffn_dim to test; and find that in generated ****/mbart-50_deploy/config.json, the value of --encoder_ffn_dim not influenced by me，it still is 4096, So I just leave it alone and let it go, But it did mislead you, I think this part should not be the problem；

And then for your question 2, this morning, when I found my task crashed, I backed up all generated models, logs, and _deploy;

and then I copy the /****/mbart-50-v1_deploy/pytorch_model.bin into the dir of pretrain_model/mbart-50 to replace the open source model downloaded from huggingface; The reason I do this is that I want to resume the crashed task and continue pre train based on my previous model; But after reading your reply above, I think there may be a problem with this approach, and I have to think about how to adjust it correctly. and as for the generating model before task crashed, there also have big scale model just as 6.9G, and consist of {'model': model.state_dict(), 'optimizer': optimizer.state_dict(), 'scheduler': scheduler.state_dict(), 'ctr': ctr}, and I am considering whether I need to load this large model, but I also need to confirm the generation time and state of this model, whether it is the latest state before the crash

And as for the script pretrain_nmt_new.py, it's because in the issue #35 , I mentioned that when I reload the big model, there occurred error, and then after our discussion, I edited some code in the saving and loading part of pretrain_nmt.py, and The modification part is actually very simple；
It's as following:

for this part, I commented out some code that saving the model and loading the checkpoint
CHECKPOINT_PATH = args.model_path if rank == 0: checkpoint_dict = {'model': model.state_dict(), 'optimizer': optimizer.state_dict(), 'scheduler': scheduler.state_dict(), 'ctr': 0} torch.save(checkpoint_dict, CHECKPOINT_PATH) ## Save a model by default every eval_every steps. This model will be saved with the same file name each time. torch.save(model.module.state_dict(), CHECKPOINT_PATH+".pure_model") dist.barrier() **# map_location = {'cuda:%d' % 0: 'cuda:%d' % gpu} # checkpoint_dict = torch.load(CHECKPOINT_PATH+".pure_model", map_location=map_location) # model.load_state_dict(checkpoint_dict) # optimizer.load_state_dict(checkpoint_dict['optimizer']) # scheduler.load_state_dict(checkpoint_dict['scheduler']) # del checkpoint_dict # torch.cuda.empty_cache()**

and for this part, I do something similar to achieve the same goal

## Copy the long saved model deploy folder. os.system("cp "+CHECKPOINT_PATH+"."+str(ctr)+".pure_model "+CHECKPOINT_PATH+"_deploy/pytorch_model.bin") start = time.time() # Use a barrier() to make sure that process 1 loads the model after process # 0 saves it. dist.barrier() # configure map_location properly print("Loading from checkpoint") # map_location = {'cuda:%d' % 0: 'cuda:%d' % gpu} sys.stdout.flush() # checkpoint_dict = torch.load(CHECKPOINT_PATH+".pure_model", map_location=map_location) # model.load_state_dict(checkpoint_dict) # optimizer.load_state_dict(checkpoint_dict['optimizer']) # scheduler.load_state_dict(checkpoint_dict['scheduler']) # del checkpoint_dict torch.cuda.empty_cache()

Then no other changes in this script;

raullese · Answer 4 · Fri Sep 02 2022 13:48:14 GMT+0800 (China Standard Time)

I just found out that I can upload pictures here, so I will add the change in new script pretrain_nmt_new.py

raullese · Answer 5 · Fri Sep 02 2022 13:51:26 GMT+0800 (China Standard Time)

And for the resume of crashed task, Is it more reasonable to load this mbart-50-v1.300000 to retraining?
And I haven't tried it yet and I don't know how to modify the code of load model @prajdabre

Raj Dabre · Answer 6 · Fri Sep 02 2022 14:38:12 GMT+0800 (China Standard Time)

Hi,

I just pushed some code a moment ago with some explanation in the commit which I am pasting here.

Changes:

Split the use_official_pretrained flag into use_official_pretrained and use_official_pretrained_tokenizer to give more control. If you want to use an official tokenizer but not the model then just pass --use_official_pretrained_tokenizer when passing the offical model's name. If you want to use an official model then automatically the official tokenizer will be used by default using --use_official_pretrained.
A new flag called --locally_fine_tuned_model has been added. This flag makes sure that you can now resume a fine-tuning run of a local model. Suppose you were fine-tuning mbart50 but suddenly the training crashed or it finished. Now if you want to resume training on some other data then you simply need to use this flag and point it to the relevant model checkpoint that was saved locally. This can have two possibilities: you resume from the previous optimizer state or you reset the optimizer. For the former you need to pass the model without the "pure_model" suffix. For the latter, you can use the model without the "pure_model" suffix and the flag --no_reload_optimizer_ctr_and_scheduler OR you can just use the model with the "pure_model" suffix.

With these changes, you can keep your original fine-tuning command (the one which finished till 300k iterations) with the following changes: --locally_fine_tuned_model --batch_size 1024 --multistep_optimizer_steps 2

--multistep_optimizer_steps 2 will simulate a 2048 batch size.

Please try and let me know if it works. If yes then you should see the model resuming from iteration 300000.

raullese · Answer 7 · Fri Sep 02 2022 21:49:31 GMT+0800 (China Standard Time)

Thank you for reply @prajdabre

And I use your latest version of pretrain_nmt.py and common_utils.py, and have one confused point,
if I understand correctly,

this part is corresponding to reloading the latest local model_state of mbart-50-v1.300000, and I set the marker point at the red box and it does print in the log：reload model successful!

And the following part is corresponding to reloading the optimizer, right?

but as I set the marker point at the red box shows that probably exist optimizer not be reloaded, what's the reason for that?

and the following is my script's command:
nohup python pretrain_nmt.py -n 1 -nr 0 -g 4 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=4096 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 1024 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --num_batches 10000000 --save_intermediate_checkpoints --data_sampling_temperature 1.0 --hard_truncate_length 512 --max_length 512 --locally_fine_tuned_model_path resume_model/mbart-50-v1.300000 --multistep_optimizer_steps 2 --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &

and I confirm with you that in your new version of pretrain_nmt.py, I just change two parts;
one change is as follows:

the other change is :

set these two changes in order to resolve the problem of #35
and The rest of the content is the same as your latest version

raullese · Answer 8 · Fri Sep 02 2022 21:55:45 GMT+0800 (China Standard Time)

oh, I see, I found that in my commands, I accidentally set this parameter --no_reload_optimizer_ctr_and_scheduler，I should remove it to ensure that exist optimizer can reload successful, right?

Raj Dabre · Answer 9 · Fri Sep 02 2022 22:05:25 GMT+0800 (China Standard Time)

Exactly! You got it! Hope it runs smoothly now.

raullese · Answer 10 · Fri Sep 02 2022 22:05:49 GMT+0800 (China Standard Time)

@prajdabre It seems successful when I remove the --no_reload_optimizer_ctr_and_scheduler
It seems that the program is running smoothly at present, I will check with you about the next step when other issue appear, thank you very much,
When you have free time, you can take the time to help me confirm the changes I listed above in pretrain_nmt.py, and what are the problems with the command, or the settings that need to be added, thanks

raullese · Answer 11 · Fri Sep 02 2022 22:06:02 GMT+0800 (China Standard Time)

Exactly! You got it! Hope it runs smoothly now.

yeah, thanks

Raj Dabre · Answer 12 · Fri Sep 02 2022 22:25:09 GMT+0800 (China Standard Time)

I think I figured out why your model run crashes.

The part you commented out contains del checkpoint_dict

Actually this dict maintains a copy of the model parameters and this takes up GPU space.

Originally this would be deleted but since you commented it out it remains in memory leading to the OOM error.

Note that this issue will exist for your previous run which trained up to 300k params. But with the current run the issue won't exist since the checkpoint_dict is deleted.

Hope it makes sense.

raullese · Answer 13 · Sat Sep 03 2022 10:37:05 GMT+0800 (China Standard Time)

I think I figured out why your model run crashes.

The part you commented out contains del checkpoint_dict

Actually this dict maintains a copy of the model parameters and this takes up GPU space.

Originally this would be deleted but since you commented it out it remains in memory leading to the OOM error.

Note that this issue will exist for your previous run which trained up to 300k params. But with the current run the issue won't exist since the checkpoint_dict is deleted.

Hope it makes sense.

In fact, I also suspect this is the reason, but I am not too sure, and this is not the first time that the OOM problem has occurred. I will observe it again during this time, and I hope this problem has been solved. The current GPU memory usage is stable at 30G/40G, Then because of the --multistep_optimizer_steps 2, the time consumed per 100 batches is less than doubled