Transformer training problem

Question

Transformer training problem

lonelygoatherd opened this issue 4 years ago · comments

Hi, is there any other tricks on training Transformer used in you work? I have used the model to train another graph task, but the "Transformer + Batchnorm +Softmax" seems never convergent?

Chaitanya Joshi · Answer 1 · Tue Jan 05 2021 12:54:35 GMT+0800 (China Standard Time)

Thank you for your interest. All details for training are exactly the same as in the paper associated with this repository; there are no additional tricks.

That being said, if you are working on some other graph task which is not based on Sketch, I think it is normal that the exact architecture we proposed may not give the best performance. You will have to play around with the architecture design such as choice of hidden dimension, layers, normalization scheme, etc.

lonelygoatherd · Answer 2 · Tue Jan 05 2021 14:19:39 GMT+0800 (China Standard Time)

In fact, I used a LayerNorm+Transformer and BatchNorm+Transformer architecture respectively and tested different hyperparameters such as lr, hidden layers. It just  did not work. Then I used a quite similar architecture of yours and got the same problem. Softmax+CrossEntropy Loss not convergent even the lr is set to 1e-5 or even 1e-6 and the model overfits rapidly. Did you meet the same problem during training, any advice ? Thank you.

…

------------------ 原始邮件 ------------------ 发件人: "Chaitanya Joshi"<notifications@github.com>; 发送时间: 2021年1月5日(星期二) 中午12:54 收件人: "PengBoXiangShang/multigraph_transformer"<multigraph_transformer@noreply.github.com>; 抄送: "武少广"<434658267@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [PengBoXiangShang/multigraph_transformer] Transformer training problem (#2) Thank you for your interest. All details for training are exactly the same as in the paper associated with this repository; there are no additional tricks. That being said, if you are working on some other graph task which is not based on Sketch, I think it is normal that the exact architecture we proposed may not give the best performance. You will have to play around with the architecture design such as choice of hidden dimension, layers, normalization scheme, etc. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.