Training stuck on validation

Question

Training stuck on validation

delton137 opened this issue 5 years ago · comments

Hi,

I'm trying to train a 3D Mask R-CNN on large CT scans (sizes around 512x512x300)

In my configs I have
self.pre_crop_size_3D = [64, 64, 64]
self.patch_size_3D = [64, 64, 64]

Relevant validation settings are:
self.val_mode = 'val_sampling
if self.val_mode == 'val_patient':
self.max_val_patients = 1 # if 'None' iterates over entire val_set once.
if self.val_mode == 'val_sampling':
self.num_val_batches = 1

My training keeps getting stuck on validation... all 32 cores are running but nothing much seems to be happening. I've been waiting an hour or more.. Is it normal for this to take a long time?

I will continue to debug but thought I'd ask here.

pfjaegerfb · Answer 1 · Wed Sep 25 2019 06:08:34 GMT+0800 (China Standard Time)

hi I assume your process is stuck here:

https://github.com/pfjaeger/medicaldetectiontoolkit/blob/5b6bea6884ec91783500b6d57fbd20edcd91576b/utils/dataloader_utils.py#L23

This can happen if your dataset (in your case your validation set i guess) are too small and do not contain samples of all classes. You need to exchange the method for creating batches here:

https://github.com/pfjaeger/medicaldetectiontoolkit/blob/5b6bea6884ec91783500b6d57fbd20edcd91576b/experiments/lidc_exp/data_loader.py#L226

sorry for that, I will fix this in the near future but I advise you to not wait for my fix.

Daniel C. Elton · Answer 2 · Thu Sep 26 2019 03:30:14 GMT+0800 (China Standard Time)

Thanks for the reply, but I don't think that's the problem because I only have 2 classes (background and foreground (0,1)).

Not all images contain segmentation masks (some are just 0 everywhere). I assume that wouldn't cause any problems though.

pfjaegerfb · Answer 3 · Thu Sep 26 2019 03:32:09 GMT+0800 (China Standard Time)

no, empty masks should be no problem. how many foreground images are in your validation set and what is your batch size?

Daniel C. Elton · Answer 4 · Sat Sep 28 2019 00:40:01 GMT+0800 (China Standard Time)

I've been running with 90 training scans 12 validation scans.

I don't know exact numbers but there are usually ~2 segmentations in each scan. (we are trying to detect and segment plaques in the aorta). So 2 small segmentations in each image (lots of "empty" space).

For right now I am running with a validation set with only one image. It is slow but manageable for testing purposes. I think some aspects of the code are slowed by disk operations (the GPU and CPUs are barely utilized).. but it's hard for me to tell.

Daniel C. Elton · Answer 5 · Sat Sep 28 2019 00:41:39 GMT+0800 (China Standard Time)

I have experimented with batch sizes between 1 and 32. I am hoping to parallelize on 4 GPUs but right now it's only utilizing one (I've set CUDA_VISIBLE_DEVICES for all 4). I've tested on Titan Z and V100, and the V100s are faster (it seems) even though the custom cuda code you reference was designed for Titan Z.

Gregor1337 · Answer 6 · Sat Apr 18 2020 21:49:26 GMT+0800 (China Standard Time)

Hi Delton,
were you, in the meantime, able to resolve these issues?

Daniel C. Elton · Answer 7 · Sun Apr 19 2020 02:01:46 GMT+0800 (China Standard Time)

Yes, I think so, but I don't remember what I did. I can look if you'd like. In any case, the training was very slow I stopped working on this and decided to go with a patch-based 3D U-Net for our application.

Gregor1337 · Answer 8 · Thu Apr 23 2020 20:49:59 GMT+0800 (China Standard Time)

Alright, thanks for getting back to us. I'll mark this as closed then.

Manca · Answer 9 · Tue Oct 13 2020 20:58:36 GMT+0800 (China Standard Time)

I think this can be a memory problem. I had the same problem which I solved by reducing the number of n_workers in config file.