microsoft / ContextualSP

Multiple paper open-source codes of the Microsoft Research Asia DKI group

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training on CANRAD data did not save model.tar.gz for me

mriganktiwari opened this issue · comments

I tried training a model with CANRAD data, but the checkpoints don't show any .tar.gz file.
Could someone kindly help?

commented

@mriganktiwari Thanks for your interest on our work! Do you manually shut down the training? If the training finishes as expected, there should be along with a file .tar.gz.

Yes, it got killed midway with this error

ROUGE: 0.0000, EM: 0.0000, F1: 0.0000, F2: 0.0000, F3: 0.0000, BLEU4: 0.0000, loss: 0.0882 ||:  58%|#####8    | 4573/7882 [21:50<11:54,  4.63it/s]Traceback (most recent call last):
  File "/home/mrigank/miniconda3/envs/uttre/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/commands/train.py", line 117, in train_model_from_args
    train_model_from_file(args.param_path,
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/commands/train.py", line 163, in train_model_from_file
    return train_model(params,
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mrigank/code/ContextualSP/incomplete_utterance_rewriting/src/./model.py", line 250, in forward
    attn_map = self.segmentation_net(attn_input)
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mrigank/code/ContextualSP/incomplete_utterance_rewriting/src/./attn_unet.py", line 45, in forward
    x = self.up2(x, x1)
  File "/home/mrigank/miniconda3/envs/uttre/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mrigank/code/ContextualSP/incomplete_utterance_rewriting/src/./attn_unet.py", line 111, in forward
    x = torch.cat([x2, x1], dim=1)
RuntimeError: CUDA out of memory. Tried to allocate 706.00 MiB (GPU 0; 7.93 GiB total capacity; 5.97 GiB already allocated; 672.81 MiB free; 6.53 GiB reserved in total by PyTorch)
ROUGE: 0.0000, EM: 0.0000, F1: 0.0000, F2: 0.0000, F3: 0.0000, BLEU4: 0.0000, loss: 0.0882 ||:  58%|#####8    | 4573/7882 [21:52<15:49,  3.48it/s]

With CUDA error - in case my memory is not sufficient, it shouldn't start training in first place, isn't that correct?
What hyperparameter can I cut down to complete the training with comparable performance on test set? My GPU is GTX1080 8GB ram.

commented

@mriganktiwari Yeah it is correct that if the memory is not sufficient, the training will not finish. You may try to reduce the hyper-parameter batch_size to make the code able to run under GPU memory < 8GB. BTW, if you do not use BERT, it is strange to see such a high memory cost.

commented

Closed since there is no more activity. Feel free to re-open.