LiyuanLucasLiu / Transformer-Clinic

Hello, I really liked your paper and trying to reproduce the result. Meanwhile, I am curious about the implementation of ADMIN.

On your implemented code (

Transformer-Clinic/fairseq/fairseq/modules/transformer_layer.py

Lines 170 to 178 in 60abd66

    
           global encoder_ratio, tmp_file 
        
           tmp_layer_ind = self.layer_num * 2 + 1 
        
           tmp_ratio = encoder_ratio 
        
           tmp_file.write('{} {}\n'.format(tmp_layer_ind, tmp_ratio)) 
        
           self.attention_ratio_change.data.fill_(tmp_ratio) 
        
           print ('encoder attn ratio: {}'.format(tmp_ratio)) 
        
           input_std = np.var( (residual*self.attention_ratio_change) .clone().cpu().float().data.contiguous().view(-1).numpy()) 
        
           output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy()) 
        
           encoder_ratio = np.sqrt(input_std + output_std)

), the variance seems to be a scalar. However the paper said it is a D-dimensional vector, so it seems to be a mismatch. Is it okay to use scalar?

Furthermore,

Transformer-Clinic/fairseq/fairseq/modules/transformer_layer.py

Lines 176 to 177 in 60abd66

    
           input_std = np.var( (residual*self.attention_ratio_change) .clone().cpu().float().data.contiguous().view(-1).numpy()) 
        
           output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy())

shows that the code uses the variance of both input and output, which is quite different from the original paper. The calculation of w_i on the initialization stage also seems to be different.

Thanks for reaching out and glad you enjoy our paper : )

The variance is a indeed a scalar. I will add a note on the paper to emphasize this. It also helps if you can share some thoughts on the part that confuses you.

Specifically, we assume every element of the input subjects to the same distribution, and every element of the same parameter subjects to the same distribution (and are iid), which means every dimension of the output would also subject to the same distribution. Thus it is sufficient to use a one-dimension distribution to categorize the model, at initialization.

You can find similar settings in understanding the difficulty of training deep feedforward neural network, and Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

At initialization, $f_j(x_{j-1}$ is independent to each other. Therefore, $\sum_{j<i} Var[f_j(\cdot)] = Var[\sum_{j<i-1} f_j(\cdot)] + Var[f_{i-1}(\cdot)]$. The output_std here is used to calculate to $Var[ f_{i-1}]$ and input_std $Var[\sum_{j<i-1} f_j(\cdot)]$ (input_std is the variance of residual*self.attention_ratio_change).

I'm sorry that this part of implementation is confusing. For initialization calculation, it is mostly equivalent to (all residual are normalized to $Var[\cdot] = 1$ by LayerNorm, expect for the first layer) the following version (both of them work well):

global encoder_ratio, tmp_file
tmp_layer_ind = self.layer_num * 2 + 1
tmp_file.write('{} {}\n'.format(tmp_layer_ind, encoder_ratio))
self.attention_ratio_change.data.fill_(encoder_ratio)
output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy())
encoder_ratio = np.sqrt(encoder_ratio**2 + output_std)

In my implementation, the reason I choose input_std is that it allows me to add some additional sanity check and visualization related to input_std (e.g., whether input_std + output_std = Var[x + residual * self.attention_ratio_change]).

	global encoder_ratio, tmp_file
	tmp_layer_ind = self.layer_num * 2 + 1
	tmp_ratio = encoder_ratio
	tmp_file.write('{} {}\n'.format(tmp_layer_ind, tmp_ratio))
	self.attention_ratio_change.data.fill_(tmp_ratio)
	print ('encoder attn ratio: {}'.format(tmp_ratio))
	input_std = np.var( (residual*self.attention_ratio_change) .clone().cpu().float().data.contiguous().view(-1).numpy())
	output_std = np.var(x.clone().cpu().float().data.contiguous().view(-1).numpy())
	encoder_ratio = np.sqrt(input_std + output_std)

Difference of implementation from the original paper