Modifying ALiBi for Encoder-Attention or Cross-Attention

Question

Modifying ALiBi for Encoder-Attention or Cross-Attention

ofirpress opened this issue 3 years ago · comments

In our paper we only showed results on causal language models, which use causally masked (decoder) self-attention.

If you'd like to use ALiBi for seq2seq tasks such as translation, speech or T5, or if you'd like to use ALiBi for masked language models such as BERT, some modifications are required.

Encoder-Attention

Encoder-Attention is the non-masked self-attention that is performed in the encoder of seq2seq models such as translation models or T5. This is also the same kind of attention used in MLM models such as BERT.

We can't naively copy paste the ALiBi code for these models because it won't work. We use a trick to quickly calculate the bias matrix for causal language modeling, but this bias matrix is only correct for values in or below the main diagonal (since that's all that's used in causal language modeling).

            maxpos = args.tokens_per_sample
            attn_heads = args.encoder_attention_heads  
           
            context_position = torch.arange(maxpos)[:, None].cuda()
            memory_position = torch.arange(maxpos)[None, :].cuda()
            relative_position = memory_position - context_position 
            relative_position = torch.abs(relative_position).unsqueeze(0).expand(attn_heads, -1,-1)

This code correctly generates the full bias matrix. Note that the bias matrix is symmetric around the diagonal, since it computes the absolute distance between the query and key (so all distances are positive).

We're also going to need the code for generating the ALiBi slopes:

            def get_slopes(n):
                def get_slopes_power_of_2(n):
                    start = (2**(-2**-(math.log2(n)-3)))
                    ratio = start
                    return [start*ratio**i for i in range(n)]

                if math.log2(n).is_integer():
                    return get_slopes_power_of_2(n)                   #In the paper, we only train models that have 2^a heads for some a. This function has
                else:                                                 #some good properties that only occur when the input is a power of 2. To maintain that even
                    closest_power_of_2 = 2**math.floor(math.log2(n))  #when the number of heads is not a power of 2, we use this workaround. 
                    return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2*closest_power_of_2)[0::2][:n-closest_power_of_2]

There are 3 options for implementing encoder-attention ALiBi:

Symmetric: In this option, the bias we assign to query/key pairs that are +N or -N tokens apart will be the same.

                self.slopes = torch.Tensor(get_slopes(attn_heads)).cuda()*-1
                self.alibi = self.slopes.unsqueeze(1).unsqueeze(1) * relative_position
                self.alibi = self.alibi.view(1, attn_heads, maxpos, maxpos)

Now just pass self.alibi to the attention function and add it after the query*key computation.

In fairseq for example, the query*key computation is done as such: attn_weights = torch.bmm(q, k.transpose(1, 2)), and then to add the ALiBi values use:

attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
attn_weights += alibi[:,:,:tgt_len,:src_len].to(attn_weights)
attn_weights = attn_weights.view(bsz*self.num_heads, tgt_len, src_len)

Nonsymmetric: Here we are going to make the model nonsymmetric by using the same ALiBi bias as in (1), but this time, we're going to let the first half of the heads only look left and the second half only look right. We'll do this by adding a mask to our attention.
Note: This code hasn't been fully tested yet and might contain bugs.

                self._future_mask_right = torch.triu(utils.fill_with_neg_inf(torch.zeros([maxpos, maxpos])), 1).unsqueeze(0).repeat(attn_heads//2, 1, 1)
                self._future_mask_left = torch.tril(utils.fill_with_neg_inf(torch.zeros([maxpos, maxpos])), -1).unsqueeze(0).repeat(attn_heads//2, 1, 1)
                
                self.nonsym_mask = torch.cat((self._future_mask_right, self._future_mask_left), dim = 0).unsqueeze(0).cuda()
                self.slopes = torch.Tensor(get_slopes(attn_heads//2)).cuda()*-1
                
                context_position = torch.arange(maxpos)[:, None].cuda()
                memory_position = torch.arange(maxpos)[None, :].cuda()
                relative_position = memory_position - context_position
                relative_position = torch.abs(relative_position).unsqueeze(0).expand(attn_heads//2, -1,-1)

                self.alibi = self.slopes.unsqueeze(1).unsqueeze(1) * relative_position
                self.alibi = self.alibi.view(1, attn_heads//2, maxpos, maxpos)
                self.alibi = self.alibi.repeat(1, 2, 1, 1).cuda()

Again, as before, add self.alibi to the attn-weights, but this time also add the nonsym_mask tensor. (In fairseq attn_weights += nonsym_mask[:,:,:tgt_len,:src_len].to(attn_weights))

Nonsymmetric with no mask: In this approach, we don't use any masking, but instead we make the positioning non-symmetric by using different ALiBi slopes depending on whether the key is to the left or right of the query. Here, we use learned slopes but you can also do this with non-learned slopes.
Note: I haven't tested this code so it might contain bugs!

slopes_left = torch.nn.parameter.Parameter(torch.Tensor( attn_heads))
nn.init.normal_(slopes_left, -2,1)
slopes_right = torch.nn.parameter.Parameter(torch.Tensor( attn_heads))
nn.init.normal_(slopes_right, -2,1)

slopes_left = -torch.sigmoid(slopes_left)
slopes_right = -torch.sigmoid(slopes_right)

context_position = torch.arange(maxpos)[:, None]
memory_position = torch.arange(maxpos)[None, :]
relative_position = memory_position - context_position
relative_position = torch.abs(relative_position).unsqueeze(0).expand(attn_heads, -1,-1)

alibi_left = slopes_left.unsqueeze(1).unsqueeze(1) * relative_position
alibi_right = slopes_right.unsqueeze(1).unsqueeze(1) * relative_position

self.alibi = torch.triu(alibi_right) + torch.tril(alibi_left)

Check out the variation on option 3 from the LittleBird paper.

Cross-Attention

For translation models and models like T5 you will also need to implement cross-attention, which is the attention from the decoder to the encoder. The T5 model uses no positional information in cross-attention and I would recommend doing the same thing.

Implementations

NEW: lucidrains/x-transformers#88 lucidrains has implemented some of the above ideas in the x-transformers repo.

Shon Otmazgin · Answer 1 · Tue Dec 21 2021 19:24:30 GMT+0800 (China Standard Time)

Hello @ofirpress,

I have a 2 questions regarding moving from positional embedding to alibi.

Assuming we are using positional embedding of max 512 tokens.
The transformer can forward any sequence length up to 512 tokens without any modification.

When using alibi, you create the bais matrix "on the fly" on each forward pass to support different sequence lengths? or you creating the bais matrix once and just pass it to the transformer ? the second option restrict the transformer to forward only sequences of length = maxpos

maybe to define alibi once with maxpos and while adding alibi to do:

seq_len = attention_scores.size()[-1]
attention_scores += self.alibi[:, :, :seq_len, :seq_len]

the second question is:
Can we utilize the pretrained LM models like BERT to use alibi or we need to train new LM from scratch? more specific, LM like BERT trained with positional embedding, we can just refactor the code to move to alibi and then finetune from pretrained model? or do we need to train LM with alibi from scratch ?

Thanks in advance

Ofir Press · Answer 2 · Tue Dec 21 2021 21:46:07 GMT+0800 (China Standard Time)

or you creating the bais matrix once and just pass it to the transformer

Right now that's what we do, we create the ALiBi tensor once and add it to the mask. In case the current sequence length is shorter than maxpos then the mask is just cut down to the right dimensions.

If you have sequences of different lengths, there's always going to be a maxpos which is the maximum that you expect your model to ever see, just make ALiBi that size and cut it down when the sequences are shorter.

Can we utilize the pretrained LM models like BERT to use alibi or we need to train new LM from scratch?

I haven't experimented with this too much but I think that if you have a model that was trained with sinusoidal or learned embeddings you would have to retrain it from scratch if you want to use ALiBi. It would be interesting to experiment with just finetuning with ALiBi, I have no idea if that would work or not. If you do end up trying the finetuning method tell me how it goes, I'm curios!

Shon Otmazgin · Answer 3 · Tue Dec 21 2021 21:50:36 GMT+0800 (China Standard Time)

just make ALiBi that size and cut it down when the sequences are shorter.

like that ?

seq_len = attention_scores.size()[-1]
attention_scores += self.alibi[:, :, :seq_len, :seq_len]

If you do end up trying the finetuning method tell me how it goes, I'm curios!

I will !

do you know about pretrained models with ALiBi?

Ofir Press · Answer 4 · Tue Dec 21 2021 21:52:25 GMT+0800 (China Standard Time)

Yup the code you posted looks good.
We've released some trained models on WikiText-103 here.

Jason Chou · Answer 5 · Fri Aug 05 2022 06:50:09 GMT+0800 (China Standard Time)

Hi @ofirpress,

I just saw this issue and the paper you co-authored, Transformer Language Models without Positional Encodings Still Learn Positional Information. So, which of the 3 options did you go with for the MLM ALiBi experiment (Table 4)? 😉

Ofir Press · Answer 6 · Fri Aug 05 2022 07:20:12 GMT+0800 (China Standard Time)

I think it was option 2 but I just emailed Peter (who ran those experiments) and I'll tell you when he gets back to me.

Jason Chou · Answer 7 · Fri Aug 05 2022 07:30:45 GMT+0800 (China Standard Time)

I think it was option 2 but I just emailed Peter (who ran those experiments) and I'll tell you when he gets back to me.

Thanks!

I haven't experimented with this too much but I think that if you have a model that was trained with sinusoidal or learned embeddings you would have to retrain it from scratch if you want to use ALiBi. It would be interesting to experiment with just finetuning with ALiBi, I have no idea if that would work or not. If you do end up trying the finetuning method tell me how it goes, I'm curios!

FYI, changing the position embeddings of BERT and then finetuning seems to work for On Position Embeddings in BERT so it may be doable for ALiBi. They didn't quite get a better model in the end though and it's unclear whether training from scratch would work better.

Peter Izsak · Answer 8 · Sun Aug 07 2022 15:11:24 GMT+0800 (China Standard Time)

Hi @EIFY and @ofirpress , I implemented and tested the first option (symmetric) and pre-trained BERT from scratch.

Ofir Press · Answer 9 · Tue Aug 09 2022 06:07:47 GMT+0800 (China Standard Time)

Thanks @peteriz! I've also heard from others that option 2 works well, so I would try both and see what leads to better results.

@EIFY - I am not aware of any results showing that it is possible to apply ALiBi to a model that wasn't trained with it. I think its a super interesting question and I'm curious to see if it could be made possible.

Jason Chou · Answer 10 · Sat Aug 13 2022 02:41:32 GMT+0800 (China Standard Time)

I am experimenting with the following asymmetrical implementation, which uses different offsets for the linear bias forward & backward:
https://github.com/EIFY/fairseq/blob/fc1fabc8612cd25cf3e15a5623ebddd59f1219bd/fairseq/utils.py#L968-L973
My rationale is that if the model can discern a difference of 1 * slopes in linear biases, it can probably discern a difference of 0.5 * slopes. Obviously, experiments will have the final say though 😅

lyc2001 · Answer 11 · Thu Sep 08 2022 11:46:08 GMT+0800 (China Standard Time)

Hi @ofirpress ,

I'm trying to use ALiBi for machine translation.
It works well if I only apply ALiBi in the decoder (BLEU is about 55). But the performance gets worse if I apply ALiBi in the encoder and cross-attention (BLEU is about 10). I've tried the above 3 options for the encoder.
Do you have any suggestions?

Thank you in advance!

Ofir Press · Answer 12 · Thu Sep 08 2022 13:23:56 GMT+0800 (China Standard Time)

Hi @lyc2001:
Try applying ALiBi in the encoder with variation 1 or 2. Use decoder alibi as normal. Don't apply any kind of bias or position embedding to the cross-attention. That might work. Tell me how it goes.

Also: make sure you fully are not using any kind of positional embeddings when you use ALiBi

Jason Chou · Answer 13 · Sat Oct 08 2022 08:07:03 GMT+0800 (China Standard Time)

Hi @ofirpress again,

I have tested a couple of variations of ALiBi w/ MLM, using RoBERTa as the baseline (https://github.com/EIFY/fairseq). I think you may be interested in the results 🙂

Ofir Press · Answer 14 · Sat Oct 08 2022 09:29:59 GMT+0800 (China Standard Time)

Great. I'm wondering if you also tried any of the options listed above?

Jason Chou · Answer 15 · Sat Oct 08 2022 09:34:31 GMT+0800 (China Standard Time)

Symmetrical ALiBi (option 1 above) behaves almost identically to shifted asymmetrical ALiBi in my WikiText-103 experiments.

(If you really want to know I can pick out the dotted line corresponding to it 😂)

Arij-Aladel · Answer 16 · Mon Oct 24 2022 16:12:55 GMT+0800 (China Standard Time)

@ofirpress Sorry but correct me if I am wrong. The positional encoding is needed just in self-attention we do not need it in cross ateention I am referring to T5 cross attention implementation.

Sam Havens · Answer 17 · Sat Nov 05 2022 06:03:06 GMT+0800 (China Standard Time)

@Arij-Aladel yes, in the original post Ofir says

The T5 model uses no positional information in cross-attention and I would recommend doing the same thing.

Daniel Liu · Answer 18 · Wed Nov 09 2022 11:53:32 GMT+0800 (China Standard Time)

An extra data point: I've found that masking half the heads (option 2, asymmetric) worked well for my use case.

Jason Chou · Answer 19 · Wed Nov 09 2022 12:48:23 GMT+0800 (China Standard Time)

@Daniel-Liu-c0deb0t What kind of model are you building and what metrics are you optimizing?

Ofir Press · Answer 20 · Wed Nov 09 2022 15:17:31 GMT+0800 (China Standard Time)

People have asked me how to implement ALiBi for FIM models. Here are two ideas I have:

Proposal one is to use regular ALiBi but instead of constantly increasing the ALiBi bias, it should reset every time we enter a new chunk. See the picture (tokens in black, ALiBi biases in blue on second row). Since the prefix comes before the main text, the numbers there are descending. This is slightly unoptimal since this biasing kind of implies that the "mid" section is of length 0, but I don't think it's horrible.
The second idea is to use the same biasing as before, but this time, have 1/3rd of the heads look just at the prefix, 1/3rd look just at the suffix, and 1/3rd of the heads looks just at the mid. I think this would work well, based on similar ideas working for encoder-only (bidirectional) attention.

Daniel Liu · Answer 21 · Wed Nov 09 2022 15:31:55 GMT+0800 (China Standard Time)

@EIFY well my use case is pretty specialized (DNA error correction) but its a BERT-like model. In my experiments, its trained from scratch with alibi and it is very slightly better than sinusoidal absolute position encoding.

lyc2001 · Answer 22 · Thu Nov 24 2022 02:17:42 GMT+0800 (China Standard Time)

Hi @lyc2001: Try applying ALiBi in the encoder with variation 1 or 2. Use decoder alibi as normal. Don't apply any kind of bias or position embedding to the cross-attention. That might work. Tell me how it goes.

Also: make sure you fully are not using any kind of positional embeddings when you use ALiBi

Hi @ofirpress ,

Thank you! My machine translation model works well now, but I'm facing another problem. When testing on data about the same length as training, BLEU is 55, about the same as vanilla Transformer. My training data is about 10 characters per sentence on average, and I'm trying to extrapolate it on data that's about 50 characters per sentence. BLEU drops to 8. The outputs are indeed much longer than those generated by the vanilla Transformer, but they keep repeating some words and sometimes couldn't stop the sentence.

For example,

我可以帶時間好好休息好休息你們要聽 媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽媽 去煮飯
以後你一定做錯是我怎麼救你來不及你來不及你想救我 救我救我救我救我想救我救我救我救我救我救我救我救我救我救我救我救我救我救我救我

Do you have any suggestions?
Thank you in advanced!

Ofir Press · Answer 23 · Fri Nov 25 2022 19:24:41 GMT+0800 (China Standard Time)

Extrapolation has its limits-- it seems like in your use case, training on 10 characters might just be too little...

Also, to make sure there's no bug, test what happens if you train on 10 and try extrapolating to length 11 or 12 at inference. There you should see the same or slightly better BLEU.

lyc2001 · Answer 24 · Tue Nov 29 2022 15:10:21 GMT+0800 (China Standard Time)

Hi @ofirpress ,
Thank you for your reply! I'll try to apply it on shorter length sentences and see what happens.

idotan286 · Answer 25 · Thu Feb 23 2023 23:50:30 GMT+0800 (China Standard Time)

Hi there!

I'm hoping to use Alibi for T5 and was wondering if anyone could share their code. I've been having trouble locating the exact location to add the relative_position.

Jonno · Answer 26 · Thu Aug 03 2023 16:26:39 GMT+0800 (China Standard Time)

For a value of 12 I'm seeing a jump in the plotting of values, e.g. 0.7 shown below.

Is this expected behaviour?

[0.5,
 0.25,
 0.125,
 0.0625,
 0.03125,
 0.015625,
 0.0078125,
 0.00390625,
 0.7071067811865476,
 0.35355339059327384,
 0.17677669529663692,
 0.08838834764831849]

# ALiBi
#

b_cache = {}

def alibi_bias(n, device):
    if str(n) in b_cache:
        bias = b_cache[str(n)]
    else:
        bias = torch.zeros(n, n)
        for i in range(n):
            bias[i, :i] = -torch.arange(i, 0, -1)
        b_cache[str(n)] = bias
    bias = bias.to(device)
    return bias

def get_slopes(n):
    # ALiBi: When using ALiBi, we do not add position embeddings at any point in the network. The only
    # modification we apply is after the query-key dot product, where we add a static, non-learned bias:
    # softmax(q(i)K.T + m @ [-(i - 1), ..., -2, -1, 0]),
    # where scalar `m` is a head-specific slope fixed before training.
    def get_slopes_power_of_2(n):
        start = (2**(-2**-(math.log2(n)-3)))
        ratio = start
        return [start*ratio**i for i in range(n)]

    if math.log2(n).is_integer():
        return get_slopes_power_of_2(n)                   # In the paper, we only train models that have 2^a heads for some a. This function has
    else:                                                 # some good properties that only occur when the input is a power of 2. To maintain that even
        closest_power_of_2 = 2**math.floor(math.log2(n))  # when the number of heads is not a power of 2, we use this workaround.
        
        return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2*closest_power_of_2)[0::2][:n-closest_power_of_2]

#
# #####

Isaac Miller · Answer 27 · Thu Feb 01 2024 04:30:14 GMT+0800 (China Standard Time)

Has anyone tried combining ALiBi with CrossBatch from the Focused Transformer paper?