LiyuanLucasLiu / Transformer-Clinic

Understanding the Difficulty of Training Transformers

Home Page:https://arxiv.org/abs/2004.08249

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Difference of implementation from the original paper

wade3han opened this issue · comments

Hello, I really liked your paper and trying to reproduce the result. Meanwhile, I am curious about the implementation of ADMIN.

  1. On your implemented code (

    global encoder_ratio, tmp_file
    tmp_layer_ind = self.layer_num * 2 + 1
    tmp_ratio = encoder_ratio
    tmp_file.write('{} {}\n'.format(tmp_layer_ind, tmp_ratio))
    self.attention_ratio_change.data.fill_(tmp_ratio)
    print ('encoder attn ratio: {}'.format(tmp_ratio))
    input_std = np.var( (residual*self.attention_ratio_change) .clone().cpu().float().data.contiguous().view(-1).numpy())
    output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy())
    encoder_ratio = np.sqrt(input_std + output_std)
    ), the variance seems to be a scalar. However the paper said it is a D-dimensional vector, so it seems to be a mismatch. Is it okay to use scalar?

  2. Furthermore,

    input_std = np.var( (residual*self.attention_ratio_change) .clone().cpu().float().data.contiguous().view(-1).numpy())
    output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy())
    shows that the code uses the variance of both input and output, which is quite different from the original paper. The calculation of w_i on the initialization stage also seems to be different.
    image

Thanks for reaching out and glad you enjoy our paper : )

  1. The variance is a indeed a scalar. I will add a note on the paper to emphasize this. It also helps if you can share some thoughts on the part that confuses you.

Specifically, we assume every element of the input subjects to the same distribution, and every element of the same parameter subjects to the same distribution (and are iid), which means every dimension of the output would also subject to the same distribution. Thus it is sufficient to use a one-dimension distribution to categorize the model, at initialization.

You can find similar settings in understanding the difficulty of training deep feedforward neural network, and Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

  1. At initialization, $f_j(x_{j-1}$ is independent to each other. Therefore, $\sum_{j<i} Var[f_j(\cdot)] = Var[\sum_{j<i-1} f_j(\cdot)] + Var[f_{i-1}(\cdot)]$. The output_std here is used to calculate to $Var[ f_{i-1}]$ and input_std $Var[\sum_{j<i-1} f_j(\cdot)]$ (input_std is the variance of residual*self.attention_ratio_change).

I'm sorry that this part of implementation is confusing. For initialization calculation, it is mostly equivalent to (all residual are normalized to $Var[\cdot] = 1$ by LayerNorm, expect for the first layer) the following version (both of them work well):

global encoder_ratio, tmp_file
tmp_layer_ind = self.layer_num * 2 + 1
tmp_file.write('{} {}\n'.format(tmp_layer_ind, encoder_ratio))
self.attention_ratio_change.data.fill_(encoder_ratio)
output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy())
encoder_ratio = np.sqrt(encoder_ratio**2 + output_std)

In my implementation, the reason I choose input_std is that it allows me to add some additional sanity check and visualization related to input_std (e.g., whether input_std + output_std = Var[x + residual * self.attention_ratio_change]).