Question

Question

Question

song-cc opened this issue 3 years ago · comments

When I run train.py, I found the loss is 'nan' at behind of 2-epochs. Do you have this problem when you train. So I want to know why is it and how to solve this problem.

SongCC · Answer 1 · Tue Sep 14 2021 17:17:13 GMT+0800 (China Standard Time)

Duplicate of #

Sharif Amit Kamran, PhD · Answer 2 · Wed Sep 15 2021 03:01:51 GMT+0800 (China Standard Time)

Hi,

I have not encountered this problem before. Did you check if the data-sets are normalized to -1 to 1 and being passed to the models properly?
You can check the npz file by opening and then printing one row of the values.
It can be also a problem of the version of tf/keras.
Just to be sure, I will try to run the train.py again and report you the output here.
Thanks

Sharif Amit Kamran, PhD · Answer 3 · Wed Sep 15 2021 07:56:39 GMT+0800 (China Standard Time)

I have run it once again, it seems to work for Tensorflow 2.0.0 and Keras 2.3.1, the version mentioned in the repository.
Also there seems to be no NAN value for any individual losses.
Attaching the image below

Sharif Amit Kamran, PhD · Answer 4 · Wed Sep 15 2021 08:33:11 GMT+0800 (China Standard Time)

After training for more than 2 epochs I am still seeing no NAN values in all the losses.

Please clone the latest repository and see if you are using the correct version of the packages.

I am closing this issue for now. Please reopen if necessary. Thanks !

SongCC · Answer 5 · Wed Sep 15 2021 13:56:50 GMT+0800 (China Standard Time)

I am sure that my tensroflow and keras version are 2.0.0 and 2.3.1 . And I am sure that the value of .npz file is between -1 and 1. But I will still got the same problem.

SongCC · Answer 6 · Wed Sep 15 2021 13:57:42 GMT+0800 (China Standard Time)

SongCC commented 3 years ago

Sharif Amit Kamran, PhD · Answer 7 · Thu Sep 16 2021 04:40:25 GMT+0800 (China Standard Time)

Your problem is a classic case of gradient explosion. There are tons of github issues/stackexchange discussion on this.

Most of them mention the dataset being normalized (-1.0 to 1.0 in our case) properly and made into float values.

Can you mention which Data-set you used DRIVE, CHASE or STARE. Also, can you provide the .npz file as bitbucket, dropbox or drive link here ? I want to try it on my code.

Thanks

SongCC · Answer 8 · Thu Sep 16 2021 13:54:10 GMT+0800 (China Standard Time)

I use DRIVE dataset,but the .npz is large.
Could you Email me? I will Email the .npz file.
My Email is songchongchongde@163.com
Thanks very much.

SongCC · Answer 9 · Thu Sep 16 2021 20:04:32 GMT+0800 (China Standard Time)

If you can not download from：https://pan.baidu.com/s/1I8CNRCq_1WZ-TrAO83Nyzg
the code is ：driv ,please email me

Sharif Amit Kamran, PhD · Answer 10 · Sat Sep 18 2021 06:57:37 GMT+0800 (China Standard Time)

Please use this npz file and train using the python file given in the repository. Also post the results here.

Thanks

https://drive.google.com/file/d/1j1BErDnxJIjA3VEgrr66tXQFyW-NTzDr/view?usp=sharing

SongCC · Answer 11 · Sat Sep 18 2021 09:57:05 GMT+0800 (China Standard Time)

Thanks very much.
It's better if you can link the original DRIVE data images.
I want to known why we got this problem.

Sharif Amit Kamran, PhD · Answer 12 · Sun Sep 19 2021 05:21:48 GMT+0800 (China Standard Time)

It's already given in the Readme.MD file.