HAN training is unstable

Question

HAN training is unstable

smiler96 opened this issue 4 years ago · comments

when i training your han model, i found the loss exploded and model collapsed! Have your met this？or can you give me some guidence?

Egqawkq · Answer 1 · Sun Dec 20 2020 16:33:50 GMT+0800 (China Standard Time)

Hi, @smiler96
I use the command provided by author(using pre-trained RCAN model) and meet the same problem.
So, you use pre-trained RCAN.pt? It seems that train the whole model from the scratch will be ok.
Have you solved this problem? the owner seems to give up this repo.

smiler · Answer 2 · Sun Dec 20 2020 16:46:30 GMT+0800 (China Standard Time)

Hi, @smiler96
I use the command provided by author(using pre-trained RCAN model) and meet the same problem.
So, you use pre-trained RCAN.pt? It seems that train the whole model from the scratch will be ok.
Have you solved this problem? the owner seems to give up this repo.

No pretrained model used when I reimplemented HAN，you can find it in my github repo.
For this issue i solved it with global residual connection. I think you can try it.

Egqawkq · Answer 3 · Sun Dec 20 2020 16:54:09 GMT+0800 (China Standard Time)

thanks @smiler96 , but have you trained the HAN? how about its final result on benchmark? I merge HAN into EDSR-pytorch repo(because my GPU can't support cuda8) and the previous 20 epoch don't meet unstable problem.

HAN use long residual connection as well, I want to know what's difference between your method and the model owner provided, because It looks the same except for another long residual connection

smiler · Answer 4 · Mon Dec 21 2020 10:18:25 GMT+0800 (China Standard Time)

thanks @smiler96 , but have you trained the HAN? how about its final result on benchmark? I merge HAN into EDSR-pytorch repo(because my GPU can't support cuda8) and the previous 20 epoch don't meet unstable problem.

HAN use long residual connection as well, I want to know what's difference between your method and the model owner provided, because It looks the same except for another long residual connection

I remember that the first several training epochs of the vanilla HAN were stable as the above fig showing. But I have not figured out the issue why the loss exploded.
The same HAN except the long residual connection is in my repo.
I have not calculated the PSNR values of each methods, sorry about that.

Egqawkq · Answer 5 · Mon Dec 21 2020 11:04:38 GMT+0800 (China Standard Time)

Ok, also thanks for your reply!

Dabing Yu · Answer 6 · Sat Jul 03 2021 21:24:42 GMT+0800 (China Standard Time)

Hi, @smiler96
I use the command provided by author(using pre-trained RCAN model) and meet the same problem.
So, you use pre-trained RCAN.pt? It seems that train the whole model from the scratch will be ok.
Have you solved this problem? the owner seems to give up this repo.

No pretrained model used when I reimplemented HAN，you can find it in my github repo.
For this issue i solved it with global residual connection. I think you can try it.

Hi, @smiler96, I also meet the unstable in the train process. you speak a global residual connection can resolve it . I want to know the difference between your repo and the model owner provided, i can not find the specific operation in your repo.

smiler · Answer 7 · Sun Jul 04 2021 10:33:55 GMT+0800 (China Standard Time)

Hi, @smiler96
I use the command provided by author(using pre-trained RCAN model) and meet the same problem.
So, you use pre-trained RCAN.pt? It seems that train the whole model from the scratch will be ok.
Have you solved this problem? the owner seems to give up this repo.

No pretrained model used when I reimplemented HAN，you can find it in my github repo.
For this issue i solved it with global residual connection. I think you can try it.

Hi, @smiler96, I also meet the unstable in the train process. you speak a global residual connection can resolve it . I want to know the difference between your repo and the model owner provided, i can not find the specific operation in your repo.

global_res=True

Dabing Yu · Answer 8 · Sun Jul 04 2021 11:19:11 GMT+0800 (China Standard Time)

global_res=True

OK，thanks for your reply.

Aatiqa Bint e Ghazali · Answer 9 · Wed Nov 10 2021 13:48:10 GMT+0800 (China Standard Time)

I did not faced that issue .May be i have turned gradient clipping on in 'options.py' file that's why.

Dannyxu1031 · Answer 10 · Tue Nov 08 2022 09:25:01 GMT+0800 (China Standard Time)

I did not faced that issue .May be i have turned gradient clipping on in 'options.py' file that's why.

Hi,how did you set the '--gclip' in 'options.py'?

Xianghui Que · Answer 11 · Thu Feb 16 2023 14:13:10 GMT+0800 (China Standard Time)

Hi, @smiler96 I use the command provided by author(using pre-trained RCAN model) and meet the same problem. So, you use pre-trained RCAN.pt? It seems that train the whole model from the scratch will be ok. Have you solved this problem? the owner seems to give up this repo.

Hi, I am finding pre-trained RCAN model. Could you do me a favour to tell me how find pre-trianed RCAN model(or just give me a link). Thank you.

smiler · Answer 12 · Fri Feb 17 2023 09:28:22 GMT+0800 (China Standard Time)

Hi, @smiler96 I use the command provided by author(using pre-trained RCAN model) and meet the same problem. So, you use pre-trained RCAN.pt? It seems that train the whole model from the scratch will be ok. Have you solved this problem? the owner seems to give up this repo.

Hi, I am finding pre-trained RCAN model. Could you do me a favour to tell me how find pre-trianed RCAN model(or just give me a link). Thank you.

hi, you can refer to the repo https://github.com/smiler96/Image-Super-Resolution