"train.py" crush when using flag `--use_mixture_loss`
BarRozenman opened this issue · comments
I run the train.py as follows
CUDA_VISIBLE_DEVICES=0 torchrun train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss
and I get
>> CUDA_VISIBLE_DEVICES=0 torchrun train.py \ [main]
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right \
--use_mixture_loss \
./trainer_1stage.py not exist!
copy ./networks/depth_decoder.py -> ./log/ResNet/exp1/depth_decoder.py
copy ./train_ResNet.sh -> ./log/ResNet/exp1/train_ResNet.sh
train ResNet
use 49 xy planes, 14 xz planes and 0 yz planes.
use DenseAspp Block
use mixture Lap loss
use plane residual
Training model named:
exp1
Models and tensorboard events files are saved to:
./log/ResNet
Training is using:
cuda
Using split:
eigen_full_left
There are 22600 training items and 1776 validation items
Training
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "/home/bar/projects/PlaneDepth/train.py", line 21, in <module>
trainer.train()
File "/home/bar/projects/PlaneDepth/trainer.py", line 248, in train
self.run_epoch()
File "/home/bar/projects/PlaneDepth/trainer.py", line 300, in run_epoch
losses["loss/total_loss"].backward()
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27588) of binary: /home/bar/miniconda3/envs/planedepth/bin/python
Traceback (most recent call last):
File "/home/bar/miniconda3/envs/planedepth/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/bar/miniconda3/envs/planedepth/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-24_18:06:29
host : clikaws105
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 27588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Any other configuration for "train.py" I use without --use_mixture_loss
run smoothly.
for example, command below runs well.
CUDA_VISIBLE_DEVICES=0 torchrun train.py \
--png \
--model_name exp1 \
--use_denseaspp \
--plane_residual \
--flip_right
Can anyone please help me fix this?
Hi, I tested the code on various types of GPUs but did not get the same error
Loss Function
change if self.opt.use_mixture_loss
in line 728 of trainer.py to if False
to eliminate the problem of loss calculation.
Warping Function
change if self.opt.use_mixture_loss
in line 594 of trainer.py to if False
to eliminate the problem of sigma-dependent probability.
change if self.opt.use_mixture_loss
in line 570 of trainer.py to if False
to eliminate the problem of sigma warping.
The flag in depth_decoder.py
also cause difference, but I don't think it's the main problem.
I hope these may halp you catch the troublemaker. Please let me know if you got anything new.
Thank you, but I was hoping that I could use the mixture loss and not just avoid it
Yeah, you’re right. But maybe these tips can help you find the main issue which might be just 1 or 2 lines of code and then we can fix it to stop this error.
Could you add the list of the GPUs that you tested the repository on?
I sloved it! all I had to do is to reduce the batch_size and it runs smoothly, maybe this issue can be added to a "known issues" section in the readme.md
file
Awesome! I'm so happy that the problem is solved! I will add it to readme.md
following your suggestion, thank you!