casual attention implementation

Question

casual attention implementation

liecn opened this issue 8 months ago · comments

After examining the code, I find it a bit unclear how the causal attention is implemented. Specifically, I'm seeking clarification on how the mask in the following Python code can be translated into a C implementation.

Thanks!

Giles Bathgate · Answer 1 · Sat Jan 13 2024 03:07:05 GMT+0800 (China Standard Time)

@liecn its the line:

scores = scores + self.mask...

That does the work.

Suppose your mask was:

0,-inf,-inf,-inf
0,   0,-inf,-inf
0,   0,   0,-inf
0,   0,   0,   0

Adding 0 to the scores does nothing, adding -inf to scores makes them -inf, hence masks them (because softmax finds the maximum, and -inf is the minimum)

Since you don't need the causal mask during inference, are you doing a c implementation of the training?

Chenning Li · Answer 2 · Sat Jan 13 2024 03:31:15 GMT+0800 (China Standard Time)

Ohh, I see. Thanks, Giles!

Does this imply that I can eliminate the casual attention (i.e., the mask) by directly removing that part in the Python code for training, without requiring any changes in the C implementation for inference?

Giles Bathgate · Answer 3 · Sat Jan 13 2024 03:58:03 GMT+0800 (China Standard Time)

@liecn When training, you provide the complete sequence to the self-attention heads. However, it's crucial to mask out scores that represent affinities between past tokens and future tokens. This ensures that the model only calculates affinities between tokens that precede any given token.

If you don't implement a causal mask, the model won't be trained to be autoregressive (i.e., capable of predicting the future). Instead, it will function as a bag-of-words model, lacking the ability to capture sequential relationships.