"train.py" crush when using flag `--use_mixture_loss`

Question

"train.py" crush when using flag `--use_mixture_loss`

BarRozenman opened this issue 2 years ago · comments

I run the train.py as follows

CUDA_VISIBLE_DEVICES=0 torchrun  train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss

and I get

>> CUDA_VISIBLE_DEVICES=0 torchrun  train.py \                                                                                                                                             [main]
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss \

./trainer_1stage.py not exist!
copy ./networks/depth_decoder.py -> ./log/ResNet/exp1/depth_decoder.py
copy ./train_ResNet.sh -> ./log/ResNet/exp1/train_ResNet.sh
train ResNet
use 49 xy planes, 14 xz planes and 0 yz planes.
use DenseAspp Block
use mixture Lap loss
use plane residual
Training model named:
   exp1
Models and tensorboard events files are saved to:
   ./log/ResNet
Training is using:
   cuda
Using split:
   eigen_full_left
There are 22600 training items and 1776 validation items

Training
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
  File "/home/bar/projects/PlaneDepth/train.py", line 21, in <module>
    trainer.train()
  File "/home/bar/projects/PlaneDepth/trainer.py", line 248, in train
    self.run_epoch()
  File "/home/bar/projects/PlaneDepth/trainer.py", line 300, in run_epoch
    losses["loss/total_loss"].backward()
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27588) of binary: /home/bar/miniconda3/envs/planedepth/bin/python
Traceback (most recent call last):
  File "/home/bar/miniconda3/envs/planedepth/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-24_18:06:29
  host      : clikaws105
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Any other configuration for "train.py" I use without --use_mixture_loss run smoothly.
for example, command below runs well.

CUDA_VISIBLE_DEVICES=0 torchrun  train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right

Can anyone please help me fix this?

Ruoyu Wang · Answer 1 · Mon Apr 24 2023 20:58:46 GMT+0800 (China Standard Time)

Hi, I tested the code on various types of GPUs but did not get the same error😥. Since running with MLL is only a little different from L1 loss, I think these are 2 key snippet

Loss Function

change if self.opt.use_mixture_loss in line 728 of trainer.py to if False to eliminate the problem of loss calculation.

Warping Function

change if self.opt.use_mixture_loss in line 594 of trainer.py to if False to eliminate the problem of sigma-dependent probability.
change if self.opt.use_mixture_loss in line 570 of trainer.py to if False to eliminate the problem of sigma warping.

The flag in depth_decoder.py also cause difference, but I don't think it's the main problem.

I hope these may halp you catch the troublemaker. Please let me know if you got anything new.

Bar Rozenman · Answer 2 · Tue Apr 25 2023 11:51:41 GMT+0800 (China Standard Time)

Thank you, but I was hoping that I could use the mixture loss and not just avoid it

Ruoyu Wang · Answer 3 · Tue Apr 25 2023 12:26:12 GMT+0800 (China Standard Time)

Yeah, you’re right. But maybe these tips can help you find the main issue which might be just 1 or 2 lines of code and then we can fix it to stop this error.

Bar Rozenman · Answer 4 · Fri Apr 28 2023 09:20:32 GMT+0800 (China Standard Time)

Could you add the list of the GPUs that you tested the repository on?

Bar Rozenman · Answer 5 · Fri Apr 28 2023 09:42:45 GMT+0800 (China Standard Time)

I sloved it! all I had to do is to reduce the batch_size and it runs smoothly, maybe this issue can be added to a "known issues" section in the readme.md file

Ruoyu Wang · Answer 6 · Fri Apr 28 2023 13:45:49 GMT+0800 (China Standard Time)

Awesome! I'm so happy that the problem is solved! I will add it to readme.md following your suggestion, thank you!