Getting NaN at reliability loss occasionally during training

Question

Getting NaN at reliability loss occasionally during training

GabbySuwichaya opened this issue 4 years ago · comments

Could you please suggest how to solve this problem and why it happens ?

During training, I get NaN at reliability loss occasionally, which happens more often when batch size is set to small number such as 1 or 2. (threads = 1)
Here, I also attached the screen shot when it happens...

I have used the default setting in the train.py, except that batch size = 1. My computer does not have enough GPU memory when batch size > 4 .

Initially, I suspected that this problem happens due to the lack of the corresponding pixels between two image. Therefore, I have tried to skip any samples that causes the NaN loss by forcing MyTrainer.forward_backward() in train.py to return before calculating loss.backward() as shown in the following captured screen and adding "continuous" and print(details) in the if-condition in Line 55 and continue training until it finished 25 epoch.

However, my trained R2D2 WASF N16 has a drop in MMA performance as shown below in the attached plot...
Here, I have used the default setting with WASF N16 and min_size = 256, max_size =1024.

The performance of my trained model is denoted as R2D2 WASF N16 (Self-trained) and
it is compared with the downloaded pretrained R2D2_WASF_N16.pt, at the same feature extraction setting.

But the performance drop is quite obvious.
The MMA @ 3px is 0.67 instead of 0.72+/- 0.01.

Jerome Revaud · Answer 1 · Wed Jul 15 2020 17:10:48 GMT+0800 (China Standard Time)

I'm not sure what is the cause of the error in your case.

Anyway, what you can do if you lack memory is simply to accumulate the gradient over each individual images:

class MyTrainer(trainer.Trainer):
    """ This class implements the network training.
        Below is the function I need to overload to explain how to do the backprop.
    """
    def forward_backward(self, inputs):
        img1, img2 = inputs.pop('img1'), inputs.pop('img2')
        batch_size = len(img1)
        sum_loss = 0
        sum_details = {}
        for i in range(batch_size):
            sl = slice(i, i+1) # select a single image
            output = self.net(imgs=[img1[sl],img2[sl]])
            allvars = dict(inputs, **output)
            loss, details = self.loss_func(**{k:v[sl] for k,v in allvars.items()})
            if torch.is_grad_enabled(): loss.backward()
            sum_loss += loss
            sum_details = {k:v+sum_details.get(k,0) for k,v in details.items()}
return sum_loss, sum_details

Without any other modification of the code, this should lead to exactly the same results as the original code (yet using much less memory). (Disclaimer: i didn't actually execute this code, but in principle it should work perfectly. )

Also, I noticed one thing is weird in your screenshot: the UserWarning about non-writeable numpy arrays. I don't remember seeing this in my case, and this maybe related to your error.

GAAP · Answer 2 · Wed Jul 15 2020 22:12:30 GMT+0800 (China Standard Time)

Thanks for quick reply, and the information.
I will try to use your suggested code and will give an update after being able to solve the present issue.

About the present issue, I see what you mean...

Firstly, the UserWarning
I was going to ask you as well if you have ever seen this warning before ... It seems that I received this warning after updating to CUDA 10.2 and using Pytorch 1.5 (but I could not confirm it yet).

But as you mentioned... I also tried to debug where is the problem. It seems to me that this problem happens in relation with the for-loop over tqdm(self.loader).

Therefore, I think that this problem is more related to the conversion between PIL, numpy, and pytorch conversion during preparing data somewhere between CatPairDataset and PairLoader...

However, yes, this could be the root to the problem of why I get the NaN reliability loss... which I am going to talk about next...

The NaN reliability loss problem.

I found that the problem happens more often with batch_size = 1 was caused by msk the line 41 in reliability_loss is having False value.

So, I checked further in self.sampler which directs me to NghSampler2. It seems that this is because (mask == False).all() in L337.

I found that it is both because "aflow" is all NaN, as well as after assigning b1, x1, y1, then aflow[b1, :, y1, x1] is all NaN.

So, either the value of aflow or the assignment value of b1, x1, y1 is the problem. But I am more inclined that the value of *aflow is the problem.

So, I have the following questions to ask:

Could you please advice me where about in tools.dataloader or pair_dataset.py or dataset.py that you actually load the image to PIL as well as assigning and/or converting aflow between numpy and pytorch?
This is to ask for confirmation. Does the settings of NghSampler2() need to be changed for different batch_size?
For example, at the moment, sqb = -8 according to Line 50 in train.py, which means that x1, y1 will be randomly chosen, which may fall out of the non-nan area of aflow...although I also believe that it is unlikely but it would be great to have your feedback.

Here, I also have the captured screen where I found the (mask == False).all() and "aflow" is all NaN ...

GAAP · Answer 3 · Thu Jul 16 2020 13:01:05 GMT+0800 (China Standard Time)

Update! I found where in the scripts that causes the UserWarning...

It is caused by transforming from Non-Writable Numpy Arrays to TorchTensor in tool/dataloader.py...
Why this becomes a problem?
It seems that the warning is caused by the Torch vision pytorch/vision#2194.
I temporary solved this warning by using "np.array" instead of "np.asarray" in
Line 217-218 ,
Line 232, and
adding mask = np.array(mask) after Line 76

This is somewhat following the suggestion in https://stackoverflow.com/questions/39554660/np-arrays-being-immutable-assignment-destination-is-read-only

It would be great if you can confirm this solution ?
Also, the NaN problem seems to be something that still exists regardless of the warning fix. ..

So, it would be great if you would kindly provide some answers to my previous questions..

Jerome Revaud · Answer 4 · Thu Jul 16 2020 14:56:34 GMT+0800 (China Standard Time)

Hi @GabbySuwichaya

Ok, I see the problem much better now. The problem with NaN is indeed due to the fact that sometimes, one of the training pair contains not a single valid pixel.

When training with batch_size=8 (default), it is almost impossible that all image pairs are all simultaneously invalid, and since the loss is computed as an average over all valid pixels at the batch level, the problem never shows up.

However, in your case, well of course with batch_size=1 the problem will happen quite often. The easy solution would be to fix line 41 like that

loss = loss = pixel_loss[msk].sum() / (1 + msk.sum())

Note that it will not be exactly equal to the original loss, which was giving an equal weight to each valid pixel over the entire batch, whereas now each valid pixel gets a weight that depends on the number of valid pixel per image.

Another solution would be to ensure that the pair loader never returns image pairs with 'all-invalid' pixels :)

GAAP · Answer 5 · Thu Jul 16 2020 15:05:06 GMT+0800 (China Standard Time)

I see. Thank you very much, @jerome-revaud !

Also, I am not sure if this is too late to say ....
Thank you so much for releasing this awesome work and this awesome package on GitHub. :)

I think I get all the answers .... and then this issue should be closed..

Fangget · Answer 6 · Mon Aug 31 2020 10:22:44 GMT+0800 (China Standard Time)

@GabbySuwichaya have you resoved the problem for MMA@3 drop? I also observe the same drop even for batch_size 8 and can not firgure out why.

GAAP · Answer 7 · Mon Aug 31 2020 12:52:24 GMT+0800 (China Standard Time)

@FangGet .... I have resolved my problem by using a batch size of 4. Also, I find that it is unnecessary to remove the warning. I get a slightly better stats with a batch size of 4 than 8 (Here, I use N=16). This could be because of my GPU power. Also, my problem was somehow based on an older version of R2D2 (commit b23a2c4f4f608 adding MMA plots Jan/28 ).

I am not sure if the problem that you have come from the same causes. However, it is probably best to debug if you get the NaN loss at APLoss by any chance and if that is because "aflow" is all NaN? The problem that I have is because of the NaN aflow.

Fangget · Answer 8 · Thu Sep 03 2020 09:54:28 GMT+0800 (China Standard Time)

ok, I will check it, thank you.