Problem while retraining A2J on NYU

Question

Problem while retraining A2J on NYU

Tekicho opened this issue 3 years ago · comments

Hello,
My appreciations to your great work! While trying to reproduce A2J by retraining on NYU, I am getting "inf" for regression loss. My environment is :
Windows10, cuda10.1, cudnn7.6.1, pytorch1.5.1.
I am using the same hyperparameters as used in nyu.py. It seems that you are using pretrained Resnet50 in a finetunning mode with none of its weights freezed, I am right? Your help is extremly appreciated!

Boshen Zhang · Answer 1 · Mon Dec 20 2021 16:52:01 GMT+0800 (China Standard Time)

Hi, @Tekicho ， yes, we use imagenet-pretrained resnet-50 as in our training code. And I am also confused why you get a inf error..., cause this error did not happened when we training the model.

Tekicho · Answer 2 · Mon Dec 27 2021 17:40:25 GMT+0800 (China Standard Time)

There seems to be a logical error at line 151 in src_train/anchor.py:
regression_loss += regression_diff_depth.mean()
I think it should be:

regression_loss += regression_loss_depth.mean()

@zhangboshen, can you please share the latest update of the source code for anchor.py and nyu.py for training?

Tekicho · Answer 3 · Wed Mar 09 2022 16:45:50 GMT+0800 (China Standard Time)

Finally solved the problem by setting:
torch.backends.cudnn.enabled = False