[one-optimize] Optimize part of the transformer's attention-head

Question

[one-optimize] Optimize part of the transformer's attention-head

BalyshevArtem opened this issue 2 months ago · comments

What

Let's introduce two new optimization passes to simplify and accelerate part of transformer's attention-head.
Original it has the following pattern we can optimize:

1. First we can fuse

StridedSlice --- Concatenation 
StridedSlice --- Neg /

pattern as Mul operation, consisting of 1 and where there was a Neg operation there -1

As a result we will have:

2. The we can twice fuse Mul with FullyConnected nodes and get:

3. And finally Fuse horizontal fc layers, we will get single FC node :

Why

To speed up and simplify attention-based models.

How

Introduce pass to fuse StridedSlices/Neg/Concatenation as Mul pattern.
Introduce pass to fuse Mul with FullyConnected node.

SaeHie Park · Answer 1 · Thu Apr 25 2024 05:42:20 GMT+0800 (China Standard Time)

@BalyshevArtem , this is awesome!
I've resized the images a little bit smaller for better readability :)

JiHwan Yeo · Answer 2 · Thu Apr 25 2024 09:22:29 GMT+0800 (China Standard Time)

Would you let me know which model you used?

In the model I used, only one FullyConnected layer was created in the corresponding part, so it seems that the structure varies slightly depending on the model.

Balyshev Artem · Answer 3 · Thu Apr 25 2024 16:03:47 GMT+0800 (China Standard Time)

Would you let me know which model you used?

I used model generated in one of the internal repo - Modified Llama2 (split head).
It is decoder part.

Hyukjin Jeong · Answer 4 · Fri Apr 26 2024 10:19:11 GMT+0800 (China Standard Time)

@BalyshevArtem Thanks for a good idea :) As @periannath mentioned, the original pattern seems to have duplicate FCs, i.e., the two FCs are in fact the same. So the baseline would be the pattern with a single FC layer.

For the second fusion, the second MUL is for applying rotary embedding, which would be a user input (not constant) if the model supports dynamic behavior.

If a model only supports fixed positions (all input tokens' position is fixed, which means that the number of previously cached tokens is also fixed), this would be an effective optimization.

Hyukjin Jeong · Answer 5 · Fri Apr 26 2024 10:57:28 GMT+0800 (China Standard Time)

Introduce pass to fuse StridedSlices/Neg/Concatenation as Mul pattern.

This fusion looks good to me. One minor concern is that this will reduce operator counts but create a new constant tensor. It has to be considered not to increase model size too much.

Balyshev Artem · Answer 6 · Fri Apr 26 2024 16:57:15 GMT+0800 (China Standard Time)

This fusion looks good to me. One minor concern is that this will reduce operator counts but create a new constant tensor. It has to be considered not to increase model size too much.

We can fuse this pattern only, if we can then fuse Mul with const into In the fully connected operation, which is located above. As I understand it, the issue with dynamic or static rotation embedding used does not affect this fusion optimization, since there will always be an fully connected layer in front of this pattern, right?

the original pattern seems to have duplicate FCs, i.e., the two FCs are in fact the same. So the baseline would be the pattern with a single FC layer.

I'm not sure I got it right :) In the example that I used:

these two fc operations have different constants.
Or do you mean some another pattern?

Hyukjin Jeong · Answer 7 · Mon Apr 29 2024 10:21:14 GMT+0800 (China Standard Time)

the issue with dynamic or static rotation embedding used does not affect this fusion optimization, since there will always be an fully connected layer in front of this pattern, right?

Yes :)

these two fc operations have different constants.

Ah, your model seems to be the one whose attention heads are split. I thought about the pattern without head split. Below is the original pattern of rotary embedding whose heads are not split.

After heads are split, it seems that a new FC is created as FC is fused with Mul (left Mul in the above graph).

I think that kind of fusion should be applied carefully ~~suppressed~~ because it will increase model size quite much (model size is a bottleneck of performance as of now). Thanks for finding.

Hyukjin Jeong · Answer 8 · Tue Apr 30 2024 09:13:56 GMT+0800 (China Standard Time)

@BalyshevArtem Could you share any preliminary result after this optimization, e.g., impacts on cycles/traffic? If there is some sensitive information, please use our internal repo.

Balyshev Artem · Answer 9 · Thu May 02 2024 16:19:48 GMT+0800 (China Standard Time)

@BalyshevArtem Could you share any preliminary result after this optimization, e.g., impacts on cycles/traffic? If there is some sensitive information, please use our internal repo.

Sure, I will post results in internal repo :)

Below is the original pattern of rotary embedding whose heads are not split.

In this example, we can also apply some optimizations:

Fuse StridedSlices-Neg-Concatenation pattern as Mul operation, consisting of 1 and where there was a Neg operation there -1.
Fuse two muls with const values.
Fuse pattern Add(Mul(Input, Const1), Mul(Input, Const2) as Mul(Input, Const3), where Const3= Const1 + Const2

Hyukjin Jeong · Answer 10 · Wed May 08 2024 10:36:24 GMT+0800 (China Standard Time)

It seems that the first fusion is invalid. Please check the begin/end of StridedSlice.

StridedSlice A(begin:0, end:40) --- Concatenation (B+A)
StridedSlice B(begin:40, end:80)--- Neg /

The order of two sliced tensors is changed, so it is impossible to convert the pattern to a simple Mul.

Balyshev Artem · Answer 11 · Wed May 08 2024 17:02:07 GMT+0800 (China Standard Time)

It seems that the first fusion is invalid. Please check the begin/end of StridedSlice.
StridedSlice A(begin:0, end:40) --- Concatenation (B+A)
StridedSlice B(begin:40, end:80)--- Neg /
The order of two sliced tensors is changed, so it is impossible to convert the pattern to a simple Mul.

Yes, you're right, thank you! Indeed, there is a division in half and a reverse of these halves.

Such a pattern can still be optimized, but it gets more complicated. Let's expand the pattern in question by adding Fully Connected.

---- Weight_Const
|
FullyConnected---->StridedSlice A(begin:0, end:40) --- Concatenation (B+A)
             \---->StridedSlice B(begin:40, end:80)--- Neg /

So the idea is to first split weights and rotate in the same way as StridedSlices->Concatenation does. So In the example from #12917 (comment) we need change weights for FullyConnected (with shape 80 x 240) - split it into two parts by rows: 40 x 240 - first_part and 40 x 240 - second_part and reverse theirs order, now now second_part is first and first_part is second. And after that introduce Mul with negative values (first part), and then fuse it in FC and so on (as in #12917 (comment)).

It turns out to be a highly specialized optimization pattern, but at the same time it allows us to greatly reduce unnecessary calculations and even reduce the binary size, due to fusing constants and weights.
@jinevening,
The question is to: Does this pattern occur in our target models? If you find it helpful to implement such optimization, I will to do it, but if you think that this is too rare pattern that will not be useful to us, then it is better to postpone this task. What do you think? :)

Hyukjin Jeong · Answer 12 · Thu May 09 2024 09:47:20 GMT+0800 (China Standard Time)

@BalyshevArtem I've answered the question in the internal repo.