LiyuanLucasLiu / Transformer-Clinic

Understanding the Difficulty of Training Transformers

Home Page:https://arxiv.org/abs/2004.08249

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

is "tmp_weight" in transformer_layer.py useless?

zherowolf opened this issue · comments

great work!
I have two questions:

  1. is "tmp_weight" in transformer_layer.py useless? can I delete that?
  2. in the paper, you said wi is fixed when training, while in code I think it's trainable, am I right?

thx.

Thanks for asking : -)

  1. Yes, its useless, you can delete that;
  2. I dont remember the paper said ω is fixed ( each layer has the flexibility to adjust ω and depends more on its residual branch), it would be very helpful if you can point me to the part that confuses you. In the current implementatioin, ω is trainable (I dont think there is any reason to make it untrainable, the computation overhead is marginal at most). But I did some experiments with ω fixed, it leads to almost the same performance.

Thanks for quick reply and great work.
I will report experiment results with my dataset later.

Sure no problem : -)