Difference of implementation from the original paper
wade3han opened this issue · comments
Hello, I really liked your paper and trying to reproduce the result. Meanwhile, I am curious about the implementation of ADMIN.
-
On your implemented code (
Transformer-Clinic/fairseq/fairseq/modules/transformer_layer.py
Lines 170 to 178 in 60abd66
-
Furthermore,
Transformer-Clinic/fairseq/fairseq/modules/transformer_layer.py
Lines 176 to 177 in 60abd66
Thanks for reaching out and glad you enjoy our paper : )
- The variance is a indeed a scalar. I will add a note on the paper to emphasize this. It also helps if you can share some thoughts on the part that confuses you.
Specifically, we assume every element of the input subjects to the same distribution, and every element of the same parameter subjects to the same distribution (and are iid), which means every dimension of the output would also subject to the same distribution. Thus it is sufficient to use a one-dimension distribution to categorize the model, at initialization.
You can find similar settings in understanding the difficulty of training deep feedforward neural network, and Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
- At initialization,
$f_j(x_{j-1}$ is independent to each other. Therefore,$\sum_{j<i} Var[f_j(\cdot)] = Var[\sum_{j<i-1} f_j(\cdot)] + Var[f_{i-1}(\cdot)]$ . Theoutput_std
here is used to calculate to$Var[ f_{i-1}]$ andinput_std
$Var[\sum_{j<i-1} f_j(\cdot)]$ (input_std
is the variance ofresidual*self.attention_ratio_change
).
I'm sorry that this part of implementation is confusing. For initialization calculation, it is mostly equivalent to (all residual are normalized to
global encoder_ratio, tmp_file
tmp_layer_ind = self.layer_num * 2 + 1
tmp_file.write('{} {}\n'.format(tmp_layer_ind, encoder_ratio))
self.attention_ratio_change.data.fill_(encoder_ratio)
output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy())
encoder_ratio = np.sqrt(encoder_ratio**2 + output_std)
In my implementation, the reason I choose input_std
is that it allows me to add some additional sanity check and visualization related to input_std
(e.g., whether input_std + output_std = Var[x + residual * self.attention_ratio_change]).