Question concerning `float4`
dongrixinyu opened this issue · comments
I noticed when applying addition of token embedding and position embedding, llm.c uses float4 type.
// use of float4 leads to using 128-bit LDG / STG instructions in SASS,
// very helpful in memory-bound kernels like encoder_forward
__global__ void encoder_forward_kernel3(float4* out,
const int* inp, const float4* wte, const float4* wpe,
int B, int T, int C) {
int C4 = C / 4;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int N = B * T * C4;
if (idx < N) {
int bt = idx / C4;
int b = bt / T;
int t = bt % T;
int c4 = idx % C4;
int ix = inp[b * T + t];
out[b * T * C4 + t * C4 + c4] = add_float4(wte[ix * C4 + c4], wpe[t * C4 + c4]);
}
}
as well as
float vals[8][8] = {};
if(bias != NULL) {
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j += 4) {
float4 b = ld_vec(bias + oc + j);
vals[i][j+0] = b.x;
vals[i][j+1] = b.y;
vals[i][j+2] = b.z;
vals[i][j+3] = b.w;
}
}
}
Is float4 more faster than float?? Im curious because I check the float4 is a struct containing 4 float element, where the calling of pointers will consume a little bit of time.
the operation of C4 will still do more computation by the way.
Vector dtypes like float2 and float4 use faster load/store instructions.