michael-fonder / M4Depth

Official implementation of the network presented in the paper "Parallax Inference for Robust Temporal Monocular Depth Estimation in Unstructured Environments"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code freezes during validation step while training

dimaxano opened this issue · comments

I run the next command

python3 m4depth_pipeline.py --train_datadir=/home/dmitry/datasets/MidAir/pb/train/ --val_datadir='/home/dmitry/datasets/MidAir/pb/test/'  --log_dir=/home/dmitry/Documents/repos/M4Depth/logdir/ --dataset=midair --arch_depth=6 --db_seq_len=8 --seq_len=6 --num_batches=200000 -b=1 -g=1 --summary_interval_secs=120 --save_interval_secs=900 --validation_interval_secs=180 --eval_only_last_pic

With small debugging I found that code stuck at that line.

Some info about setup:

  • tf 1.15
  • 2080Ti
  • MidAir dataset (RGB + Stereo Disparities)

@michael-fonder Do you have any ideas where should I look for the source of the freeze?

Hi @dimaxano

Sorry for the long delay.

The test dataset is quite large. Several minutes are necessary to process it completely. So it may appear that the code is freezing, while it is in fact still processing the validation set. I'll ask you two more pieces of information before digging the issue further:

  • How long did you wait before concluding to a freeze?
  • Is the GPU stil active while you experience a freeze?

Hi, @michael-fonder

I didn't measured time for the all test set, but I tried to remov all test proto samples from test folder except one and run on them. Still experiencing freezes for several minutes (also cannot kill process with simple Ctrl-C, it just not responding).
And yeah, GPU utilization around zero during validation (checking from nvtop), but memory is allocated.

Using a bunch of tf.Prints I found that the problem occurs because of tf.reduce_mean in eval_func (all the get_* functions inside it). As soon as I commented reduce_mean callings and replaced its results with just a tf.constant(1.0) validation goes smoothly (but not very helpfully, hah)

Hi @dimaxano ,

Ok, I think that the issue is related to an initialization issue with the variables self.prev_f_pyrand self.prev_d_pyr in estimate_depth when the validation graph is built. I think that the solution would be to use different variables for the train and test graphs.

I'll try to correct and test this carefully as soon as I have some slack time in the next few days.

Agree, you maybe right, because when I replace est_resized here and here with gt, validation also goes smoothly

I'll try to implement your fix and let you know if that works