karpathy / llama2.c

Inference Llama 2 in one file of pure C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Prefill Processing

Nick-infinity opened this issue · comments

Hello, I am sorry if question is very basic but need a little help over here.

Cant we just skip the attention processing and continue from here for input prompt tokens.
The KV Cache will be built for all the input tokens.
Why do we need to calculate the attention for all the input tokens as well down till logits?
I cant seem to find the connection between the input tokens and generated tokens apart from the KV cache.

llama2.c/run.c

Line 280 in d986206

@karpathy

the X activation is fed back for multiple layers after being changed with help of attention output. and that's why attention is needed I guess for input tokens as well.

We can definitely avoid
rmsnorm(x, x, w->rms_final_weight, dim); // classifier into logits matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);

for input tokens . It will give some perf bump