LiyuanLucasLiu / Transformer-Clinic

Understanding the Difficulty of Training Transformers

Home Page:https://arxiv.org/abs/2004.08249

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Position of residual connection in PreLN architecture is wrong

bilzard opened this issue · comments

In the current implementation, residual connection in feed forward block comes from after layer norm[1].

        x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)
        residual = x0

However, according to the paper of PreLN architecture[2], residual variable should be before layer norm.

        residual = x0
        x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)

Sorry, this is my misunderstanding. Since this layer norm is for post norm architecture, it isn't a problem.

        x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)