karpathy / llama2.c

Inference Llama 2 in one file of pure C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

casual attention implementation

liecn opened this issue · comments

After examining the code, I find it a bit unclear how the causal attention is implemented. Specifically, I'm seeking clarification on how the mask in the following Python code can be translated into a C implementation.

image
image

Thanks!

@liecn its the line:

scores = scores + self.mask...

That does the work.

Suppose your mask was:

0,-inf,-inf,-inf
0,   0,-inf,-inf
0,   0,   0,-inf
0,   0,   0,   0

Adding 0 to the scores does nothing, adding -inf to scores makes them -inf, hence masks them (because softmax finds the maximum, and -inf is the minimum)

Since you don't need the causal mask during inference, are you doing a c implementation of the training?

Ohh, I see. Thanks, Giles!

Does this imply that I can eliminate the casual attention (i.e., the mask) by directly removing that part in the Python code for training, without requiring any changes in the C implementation for inference?

@liecn When training, you provide the complete sequence to the self-attention heads. However, it's crucial to mask out scores that represent affinities between past tokens and future tokens. This ensures that the model only calculates affinities between tokens that precede any given token.

If you don't implement a causal mask, the model won't be trained to be autoregressive (i.e., capable of predicting the future). Instead, it will function as a bag-of-words model, lacking the ability to capture sequential relationships.