ReLU in residual connections?
ibagur opened this issue · comments
Hi,
I am using part of your code for a particular implementation of a transformer architecture I need as part of my master thesis research in RL. I noticed on the original paper from (Parisotto et al., 2019) that they re-order the LayerNorms so they place them at the input of both the multihead-attention and the feed-forward sub-modules. I saw that you also implement this on your code, via a the config["layer_norm"]
setting. But on the paper they also mention, I quote: "Because the layer norm reordering causes a path where two linear layers are applied in sequence, we apply a ReLU activation to each sub-module output before the residual connection (see Appendix C for equations).". In fact, on those equations they apply a ReLU both to the output of the multihead-attention and feed-forward sub-modules, before performing the residual connection. I did not see that specific step on your code (just the standard residual connection), so I wonder whether there is a particular reason for that, or maybe I am missing something (I'm still quite novice in these implementations). In any case, congratulations for your great works, it is helping me a lot to understand the inner workings of such architectures. Thanks!
Thanks for bringing up this issue. GTrXL is still work in progress. We will investigate this detail on ReLU.
I took some time to train TrXL with pre layer norm plus the missing ReLU activations on Mortar Mayhem Grid and Mystery Path Grid. I did not observe any gains in performance.
# GRU Gate or skip connection
if self.use_gtrxl:
# Forward GRU gating
h = self.gate1(query, attention)
else:
if self.layer_norm == "pre":
attention = F.relu(attention)
# Skip connection
h = attention + query
# GRU Gate or skip connection
if self.use_gtrxl:
# Forward GRU gating
out = self.gate2(h, forward)
else:
if self.layer_norm == "pre":
forward = F.relu(forward)
# Skip connection
out = forward + h
Here is a rough plot on the training performance in Mystery Path Grid. Each experiment was run for 4 different seeds.
I'm closing this issue for now. We decided to not add the ReLU activations. We don't have the time to further investigate this right now.
Hi,
Many thanks for your feedback and very interesting results. Very much appreciated, I might give it a look on my own research, but as I can see, it seems not to have much effect (apart of adding a little bit more of computation in the model)