karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

Question concerning `float4`

dongrixinyu opened this issue · comments

I noticed when applying addition of token embedding and position embedding, llm.c uses float4 type.

// use of float4 leads to using 128-bit LDG / STG instructions in SASS,
// very helpful in memory-bound kernels like encoder_forward
__global__ void encoder_forward_kernel3(float4* out,
                               const int* inp, const float4* wte, const float4* wpe,
                               int B, int T, int C) {
    int C4 = C / 4;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int N = B * T * C4;
    if (idx < N) {
        int bt = idx / C4;
        int b = bt / T;
        int t = bt % T;
        int c4 = idx % C4;
        int ix = inp[b * T + t];
        out[b * T * C4 + t * C4 + c4] = add_float4(wte[ix * C4 + c4], wpe[t * C4 + c4]);
    }
}

as well as

    float vals[8][8] = {};
    if(bias != NULL) {
        for (int i = 0; i < 8; i++) {
            for (int j = 0; j < 8; j += 4) {
                float4 b = ld_vec(bias + oc + j);
                vals[i][j+0] = b.x;
                vals[i][j+1] = b.y;
                vals[i][j+2] = b.z;
                vals[i][j+3] = b.w;
            }
        }
    }

Is float4 more faster than float?? Im curious because I check the float4 is a struct containing 4 float element, where the calling of pointers will consume a little bit of time.
the operation of C4 will still do more computation by the way.

Vector dtypes like float2 and float4 use faster load/store instructions.