Code freezes during validation step while training

Question

Code freezes during validation step while training

dimaxano opened this issue 3 years ago · comments

I run the next command

python3 m4depth_pipeline.py --train_datadir=/home/dmitry/datasets/MidAir/pb/train/ --val_datadir='/home/dmitry/datasets/MidAir/pb/test/'  --log_dir=/home/dmitry/Documents/repos/M4Depth/logdir/ --dataset=midair --arch_depth=6 --db_seq_len=8 --seq_len=6 --num_batches=200000 -b=1 -g=1 --summary_interval_secs=120 --save_interval_secs=900 --validation_interval_secs=180 --eval_only_last_pic

With small debugging I found that code stuck at that line.

Some info about setup:

tf 1.15
2080Ti
MidAir dataset (RGB + Stereo Disparities)

@michael-fonder Do you have any ideas where should I look for the source of the freeze?

Michaël Fonder · Answer 1 · Mon Aug 23 2021 21:23:01 GMT+0800 (China Standard Time)

Hi @dimaxano

Sorry for the long delay.

The test dataset is quite large. Several minutes are necessary to process it completely. So it may appear that the code is freezing, while it is in fact still processing the validation set. I'll ask you two more pieces of information before digging the issue further:

How long did you wait before concluding to a freeze?
Is the GPU stil active while you experience a freeze?

Dmitry Klimenkov · Answer 2 · Mon Aug 23 2021 22:43:14 GMT+0800 (China Standard Time)

Hi, @michael-fonder

I didn't measured time for the all test set, but I tried to remov all test proto samples from test folder except one and run on them. Still experiencing freezes for several minutes (also cannot kill process with simple Ctrl-C, it just not responding).
And yeah, GPU utilization around zero during validation (checking from nvtop), but memory is allocated.

Using a bunch of tf.Prints I found that the problem occurs because of tf.reduce_mean in eval_func (all the get_* functions inside it). As soon as I commented reduce_mean callings and replaced its results with just a tf.constant(1.0) validation goes smoothly (but not very helpfully, hah)

Michaël Fonder · Answer 3 · Wed Aug 25 2021 23:22:33 GMT+0800 (China Standard Time)

Hi @dimaxano ,

Ok, I think that the issue is related to an initialization issue with the variables self.prev_f_pyrand self.prev_d_pyr in estimate_depth when the validation graph is built. I think that the solution would be to use different variables for the train and test graphs.

I'll try to correct and test this carefully as soon as I have some slack time in the next few days.

Dmitry Klimenkov · Answer 4 · Thu Aug 26 2021 16:23:26 GMT+0800 (China Standard Time)

Agree, you maybe right, because when I replace est_resized here and here with gt, validation also goes smoothly

I'll try to implement your fix and let you know if that works