karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

bt-invariant inference

karpathy opened this issue · comments

Currently we only ever call gpt2_forward function with a single, fixed setting of B,T, for both training and inference, e.g.:

gpt2_forward(&model, gen_tokens, NULL, B, T);

However, in principle once we forward with B,T and allocate space for B,T tokens we should be able to forward any 1 <= b <= B, 1 <= t <= T. This would be very useful for speeding up inference. We always train on chunks of B,T tokens, but when we inference we could only forward tiny blocks, e.g. as small as B=1, T=1 on first time step, and then B=1, T=2 on second time step, etc.

We'd still be recomputing all tokens 1 <= t <= T from scratch (i.e. we have no kv-cache for inference), but at least it would be a LOT cheaper than right now, which is always B*T tokens forwarded.

Once this is implemented, relax the if statement here:

        if (B != model->batch_size || T != model->seq_len) {
            printf("Model: B=%d T=%d, Desired: B=%d T=%d\n", model->batch_size, model->seq_len, B, T);
            exit(EXIT_FAILURE);
        }

And include tests inside gpt2_test.c / gpt2_test.cu that check this "bt-invariance" correctness:

  • forward B,T batch of tokens
  • forward a bunch of 1 <= b <= B, 1 <= t <= T, of the outputs, and ensure that the outputs are exactly the same up to b,t.

Most of the layers in GPT-2 are parallelized across both batch and time so they don't care and should be fine. The only problematic layer that needs adjustment is the attention layer, where you have to be careful to index correctly and distinguish carefully between the size of the array that stores activations, and the size of the extent of activations that we are asked to calculate in that forward pass.