MIC-DKFZ / medicaldetectiontoolkit

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training stuck on validation

delton137 opened this issue · comments

Hi,

I'm trying to train a 3D Mask R-CNN on large CT scans (sizes around 512x512x300)

In my configs I have
self.pre_crop_size_3D = [64, 64, 64]
self.patch_size_3D = [64, 64, 64]

Relevant validation settings are:
self.val_mode = 'val_sampling
if self.val_mode == 'val_patient':
self.max_val_patients = 1 # if 'None' iterates over entire val_set once.
if self.val_mode == 'val_sampling':
self.num_val_batches = 1

My training keeps getting stuck on validation... all 32 cores are running but nothing much seems to be happening. I've been waiting an hour or more.. Is it normal for this to take a long time?

I will continue to debug but thought I'd ask here.

hi I assume your process is stuck here:

https://github.com/pfjaeger/medicaldetectiontoolkit/blob/5b6bea6884ec91783500b6d57fbd20edcd91576b/utils/dataloader_utils.py#L23

This can happen if your dataset (in your case your validation set i guess) are too small and do not contain samples of all classes. You need to exchange the method for creating batches here:

https://github.com/pfjaeger/medicaldetectiontoolkit/blob/5b6bea6884ec91783500b6d57fbd20edcd91576b/experiments/lidc_exp/data_loader.py#L226

sorry for that, I will fix this in the near future but I advise you to not wait for my fix.

Thanks for the reply, but I don't think that's the problem because I only have 2 classes (background and foreground (0,1)).

Not all images contain segmentation masks (some are just 0 everywhere). I assume that wouldn't cause any problems though.

no, empty masks should be no problem. how many foreground images are in your validation set and what is your batch size?

I've been running with 90 training scans 12 validation scans.

I don't know exact numbers but there are usually ~2 segmentations in each scan. (we are trying to detect and segment plaques in the aorta). So 2 small segmentations in each image (lots of "empty" space).

For right now I am running with a validation set with only one image. It is slow but manageable for testing purposes. I think some aspects of the code are slowed by disk operations (the GPU and CPUs are barely utilized).. but it's hard for me to tell.

I have experimented with batch sizes between 1 and 32. I am hoping to parallelize on 4 GPUs but right now it's only utilizing one (I've set CUDA_VISIBLE_DEVICES for all 4). I've tested on Titan Z and V100, and the V100s are faster (it seems) even though the custom cuda code you reference was designed for Titan Z.

Hi Delton,
were you, in the meantime, able to resolve these issues?

Yes, I think so, but I don't remember what I did. I can look if you'd like. In any case, the training was very slow I stopped working on this and decided to go with a patch-based 3D U-Net for our application.

Alright, thanks for getting back to us. I'll mark this as closed then.

commented

I think this can be a memory problem. I had the same problem which I solved by reducing the number of n_workers in config file.