Slowdown with tokens

Question

Slowdown with tokens

cmp-nct opened this issue a year ago · comments

With each token processed the inference speed slows down a little bit, starts to become noticeable at around 50 tokens on 40B Q3_K and adds up.

Kerfuffle · Answer 1 · Sun Jun 18 2023 23:34:05 GMT+0800 (China Standard Time)

See ggerganov/ggml#231 (comment) for a hint about how to fix it.

According to Mr. GG you'd need to do something similar to ggerganov#775

John · Answer 2 · Mon Jun 19 2023 12:13:34 GMT+0800 (China Standard Time)

It sounds like @jploski is already a bit on track, I'll keep my thumbs pressed in that regard
JohannesGaessler also had made some remarks on the use of repeat, to avoid it in some cases.

logicchains · Answer 3 · Mon Jun 19 2023 19:51:48 GMT+0800 (China Standard Time)

Is it related to the issue described in this comment: https://news.ycombinator.com/item?id=36377873 ?

"The primary issue is that the current upstream version (on Huggingface) hasn't implemented key-value caching correctly. KV caching is needed to bring the complexity down from O(n^3) to O(n^2). The issues are: (1) their implementation uses Torch' scaled dot-product attention, which uses incorrect causal masks when the query/key sizes are not the same (which it the case when generating with a cache). (2) They don't index the rotary embeddings correctly when using key-value cache, so the rotary embedding of the first token is used for all generated tokens. Together, this causes the model to output garbage and it only works when using it without KV caching, making it very slow."

John · Answer 4 · Sun Jun 25 2023 04:30:36 GMT+0800 (China Standard Time)

With my latest commit performance is almost doubled at higher context windows.
Hope I didn't add any bugs but a quick test on perplexity showed no changes.
I changed the permutation of V before repeat2 to avoid large memory copy operations on the tensor.
K was easier but same thing basically.

This is still not the goal but it's a good step ahead.
The next step is to remove repeat2 completely which should be a relatively minor "hack" on ggml-CPU side but relatively major "clean" solution.

We need the mat multiplication to access the same K and V tensors without the need of repeating.
So either splitting it up into one multiplication per kv_head if that's possible or changing mul_mat so it uses a pointer to the data instead

Jan Ploski · Answer 5 · Mon Jun 26 2023 01:57:44 GMT+0800 (China Standard Time)

I believe this terrible slowdown is caused by the incorrect utilization of the KV cache. As it stands with every new token all the previous tokens' query and key vectors are calculated and multiplied by each other, leading to quadratically increasing computation time. But I believe it would be sufficient to just process only the last token (or more generally, last N, taking prompt processing and batched token generation into consideration) and retrieve all the previous ones from cache.

Jan Ploski · Answer 6 · Mon Jun 26 2023 02:02:15 GMT+0800 (China Standard Time)

Is it related to the issue described in this comment: https://news.ycombinator.com/item?id=36377873 ?

Yes, I think it's topically related (and the problem already being present in the upstream Python version would explain why I also got it wrong in the translation, apart from my ignorance). But I suspect there are also some tangential concerns in that thread related to HuggingFace's KV cache implementation that are not actually relevant to fixing it in GGML.

John · Answer 7 · Mon Jun 26 2023 02:45:06 GMT+0800 (China Standard Time)

I believe this terrible slowdown is caused by the incorrect utilization of the KV cache. As it stands with every new token all the previous tokens' query and key vectors are calculated and multiplied by each other, leading to quadratically increasing computation time. But I believe it would be sufficient to just process only the last token (or more generally, last N, taking prompt processing and batched token generation into consideration) and retrieve all the previous ones from cache.

A part of the slowdown was fixed yesterday, it's twice as fast as before now.
Here is the current state of slowdown things:
This is a 300 token generation on 7B using "--debug-timings 2" to get one snapshot at the begin and one at the end of the generation.
First token, last layer and total sum of operations:

 - 2479: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [ 60 node_2479]   CPU
 - 2480: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.006 ms [ 60 node_2480]   CPU
 - 2481: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 node_2481]   CPU
 - 2482: [  8192,  9216,   1]x[  8192,     1,   1]=[  9216,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.122 /   0.122 ms [ 60 node_2482]   GPU  (Slow)
 - 2483: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 Kcur]   CPU
 - 2484: [    64,     8,   1]x[     3,     1,   1]=[    64,     8,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 60 node_2484]   CPU
 - 2485: [15728640,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 k]   GPU
 - 2486: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 60 node_2486]   CPU
 - 2487: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 Vcur]   CPU
 - 2488: [15728640,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 v]   GPU
 - 2489: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.017 ms [ 60 node_2489]   CPU
 - 2490: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [ 60 node_2490]   CPU
 - 2491: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 node_2491]   CPU
 - 2492: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [ 60 inpFF]   CPU
 - 2493: [  8192, 32768,   1]x[  8192,     1,   1]=[ 32768,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.252 /   0.252 ms [ 60 inpFF*ff_up]   GPU  (Slow)
 - 2494: [ 32768,     1,   1]x[    47,    47,  47]=[ 32768,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.029 /   0.029 ms [ 60 node_2494]   CPU
 - 2495: [ 32768,  8192,   1]x[ 32768,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.285 /   0.285 ms [ 60 gelu_cur*ff_down]   GPU  (Slow)
 - 2496: [15728640,     1,   1]x[    47,    47,  47]=[  1024,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 node_2496]   GPU
 - 2497: [  1024,     1,   1]x[    47,    47,  47]=[    64,     8,   2]          RESHAPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2497]   CPU
 - 2498: [    64,     8,   2]x[    47,    47,  47]=[    64,     2,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2498]   CPU
 - 2499: [    64,     2,   8]x[    47,    47,  47]=[     2,    64,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2499]   CPU
 - 2500: [     2,    64,   8]x[    47,    47,  47]=[     2,    64,   8]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 node_2500]   CPU
 - 2501: [     2,    64,   8]x[     2,    64, 128]=[     2,    64, 128]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.018 /   0.018 ms [ 60 V]   CPU
 - 2502: [15728640,     1,   1]x[    47,    47,  47]=[  1024,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2502]   GPU
 - 2503: [  1024,     1,   1]x[    47,    47,  47]=[    64,     8,   2]          RESHAPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2503]   CPU
 - 2504: [    64,     8,   2]x[    47,    47,  47]=[    64,     2,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2504]   CPU
 - 2505: [    64,     2,   8]x[    47,    47,  47]=[    64,     2,   8]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 node_2505]   CPU
 - 2506: [    64,     2,   8]x[    64,     2, 128]=[    64,     2, 128]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 60 K]   CPU
 - 2507: [  9216,     1,   1]x[    47,    47,  47]=[    64,   128,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 Qcur]   CPU
 - 2508: [    64,   128,   1]x[     3,     1,   1]=[    64,   128,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 60 node_2508]   CPU
 - 2509: [    64,   128,   1]x[    47,    47,  47]=[    64,     1, 128]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 Q]   CPU
 - 2510: [    64,     2, 128]x[    64,     1, 128]=[     2,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 KQ]   CPU
 - 2511: [     2,     1, 128]x[     1,     1,   1]=[     2,     1, 128]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 KQ_scaled]   CPU
 - 2512: [     2,     1, 128]x[     2,     1,   1]=[     2,     1, 128]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 60 KQ_masked]   CPU
 - 2513: [     2,     1, 128]x[    47,    47,  47]=[     2,     1, 128]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 KQ_soft_max]   CPU
 - 2514: [     2,    64, 128]x[     2,     1, 128]=[    64,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [ 60 KQV]   CPU
 - 2515: [    64,     1, 128]x[    47,    47,  47]=[    64,   128,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 KQV_merged]   CPU
 - 2516: [    64,   128,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 node_2516]   CPU
 - 2517: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.114 /   0.114 ms [ 60 result_wo]   GPU  (Slow)
 - 2518: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 attn_out]   CPU
 - 2519: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 node_2519]   CPU
 - 2520: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 inpFF_+_result_attn_out]   CPU
 - 2521: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms [  0 norm_cur]   CPU
 - 2522: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [  0 node_2522]   CPU
 - 2523: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [  0 result_norm]   CPU
 - 2524: [  8192, 65024,   1]x[  8192,     1,   1]=[ 65024,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.429 /   0.429 ms [  0 result_lm_head]   GPU  (Slow)
perf_total_per_op_us[             ADD] =   1.900 ms
perf_total_per_op_us[             MUL] =   0.645 ms
perf_total_per_op_us[            GELU] =   1.463 ms
perf_total_per_op_us[            NORM] =   1.841 ms
perf_total_per_op_us[         MUL_MAT] =  48.567 ms
perf_total_per_op_us[           SCALE] =   0.233 ms
perf_total_per_op_us[             CPY] =   1.357 ms
perf_total_per_op_us[            CONT] =   0.122 ms
perf_total_per_op_us[         RESHAPE] =   0.120 ms
perf_total_per_op_us[            VIEW] =   0.420 ms
perf_total_per_op_us[         PERMUTE] =   0.300 ms
perf_total_per_op_us[        GET_ROWS] =   0.007 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.225 ms
perf_total_per_op_us[        SOFT_MAX] =   0.240 ms
perf_total_per_op_us[            ROPE] =   1.007 ms
perf_total_per_op_us[         REPEAT2] =   1.261 ms

Last token, last layer and total sum of operations:

 - 2479: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [ 60 node_2479]   CPU
 - 2480: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 node_2480]   CPU
 - 2481: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 node_2481]   CPU
 - 2482: [  8192,  9216,   1]x[  8192,     1,   1]=[  9216,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.194 /   0.194 ms [ 60 node_2482]   GPU
 - 2483: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 Kcur]   CPU
 - 2484: [    64,     8,   1]x[     3,     1,   1]=[    64,     8,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.006 ms [ 60 node_2484]   CPU
 - 2485: [15728640,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 k]   GPU
 - 2486: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 node_2486]   CPU
 - 2487: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 Vcur]   CPU
 - 2488: [15728640,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 v]   GPU
 - 2489: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 node_2489]   CPU
 - 2490: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [ 60 node_2490]   CPU
 - 2491: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 60 node_2491]   CPU
 - 2492: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 60 inpFF]   CPU
 - 2493: [  8192, 32768,   1]x[  8192,     1,   1]=[ 32768,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.434 /   0.434 ms [ 60 inpFF*ff_up]   GPU  (Slow)
 - 2494: [ 32768,     1,   1]x[    47,    47,  47]=[ 32768,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.022 /   0.022 ms [ 60 node_2494]   CPU
 - 2495: [ 32768,  8192,   1]x[ 32768,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.500 /   0.500 ms [ 60 gelu_cur*ff_down]   GPU  (Slow)
 - 2496: [15728640,     1,   1]x[    47,    47,  47]=[155648,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2496]   GPU
 - 2497: [155648,     1,   1]x[    47,    47,  47]=[    64,     8, 304]          RESHAPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 node_2497]   CPU
 - 2498: [    64,     8, 304]x[    47,    47,  47]=[    64,   304,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2498]   CPU
 - 2499: [    64,   304,   8]x[    47,    47,  47]=[   304,    64,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2499]   CPU
 - 2500: [   304,    64,   8]x[    47,    47,  47]=[   304,    64,   8]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.095 /   0.095 ms [ 60 node_2500]   CPU
 - 2501: [   304,    64,   8]x[   304,    64, 128]=[   304,    64, 128]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.442 /   0.442 ms [ 60 V]   CPU  (Slow)
 - 2502: [15728640,     1,   1]x[    47,    47,  47]=[155648,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2502]   GPU
 - 2503: [155648,     1,   1]x[    47,    47,  47]=[    64,     8, 304]          RESHAPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2503]   CPU
 - 2504: [    64,     8, 304]x[    47,    47,  47]=[    64,   304,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 node_2504]   CPU
 - 2505: [    64,   304,   8]x[    47,    47,  47]=[    64,   304,   8]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.035 /   0.035 ms [ 60 node_2505]   CPU
 - 2506: [    64,   304,   8]x[    64,   304, 128]=[    64,   304, 128]          REPEAT2   (  1) cpu =   1.000 /   1.000 ms, wall =   0.502 /   0.502 ms [ 60 K]   CPU  (Slow)
 - 2507: [  9216,     1,   1]x[    47,    47,  47]=[    64,   128,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Qcur]   CPU
 - 2508: [    64,   128,   1]x[     3,     1,   1]=[    64,   128,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [ 60 node_2508]   CPU
 - 2509: [    64,   128,   1]x[    47,    47,  47]=[    64,     1, 128]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Q]   CPU
 - 2510: [    64,   304, 128]x[    64,     1, 128]=[   304,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.107 /   0.107 ms [ 60 KQ]   CPU
 - 2511: [   304,     1, 128]x[     1,     1,   1]=[   304,     1, 128]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 KQ_scaled]   CPU
 - 2512: [   304,     1, 128]x[     2,     1,   1]=[   304,     1, 128]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 KQ_masked]   CPU
 - 2513: [   304,     1, 128]x[    47,    47,  47]=[   304,     1, 128]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.020 /   0.020 ms [ 60 KQ_soft_max]   CPU
 - 2514: [   304,    64, 128]x[   304,     1, 128]=[    64,     1, 128]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.068 /   0.068 ms [ 60 KQV]   CPU
 - 2515: [    64,     1, 128]x[    47,    47,  47]=[    64,   128,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 60 KQV_merged]   CPU
 - 2516: [    64,   128,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 node_2516]   CPU
 - 2517: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.190 /   0.190 ms [ 60 result_wo]   GPU
 - 2518: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 60 attn_out]   CPU
 - 2519: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.020 /   0.020 ms [ 60 node_2519]   CPU
 - 2520: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [ 60 inpFF_+_result_attn_out]   CPU
 - 2521: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [  0 norm_cur]   CPU
 - 2522: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.008 ms [  0 node_2522]   CPU
 - 2523: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.006 ms [  0 result_norm]   CPU
 - 2524: [  8192, 65024,   1]x[  8192,     1,   1]=[ 65024,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.753 /   0.753 ms [  0 result_lm_head]   GPU  (Slow)
perf_total_per_op_us[             ADD] =   2.134 ms
perf_total_per_op_us[             MUL] =   0.922 ms
perf_total_per_op_us[            GELU] =   1.705 ms
perf_total_per_op_us[            NORM] =   1.739 ms
perf_total_per_op_us[         MUL_MAT] =  91.762 ms
perf_total_per_op_us[           SCALE] =   0.310 ms
perf_total_per_op_us[             CPY] =   1.584 ms
perf_total_per_op_us[            CONT] =   9.310 ms
perf_total_per_op_us[         RESHAPE] =   0.120 ms
perf_total_per_op_us[            VIEW] =   0.420 ms
perf_total_per_op_us[         PERMUTE] =   0.300 ms
perf_total_per_op_us[        GET_ROWS] =   0.005 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.267 ms
perf_total_per_op_us[        SOFT_MAX] =   1.476 ms
perf_total_per_op_us[            ROPE] =   1.150 ms
perf_total_per_op_us[         REPEAT2] =  63.665 ms

And for comparison, this is without the patch:

perf_total_per_op_us[             ADD] =   2.443 ms
perf_total_per_op_us[             MUL] =   1.030 ms
perf_total_per_op_us[            GELU] =   2.038 ms
perf_total_per_op_us[            NORM] =   1.893 ms
perf_total_per_op_us[         MUL_MAT] = 165.571 ms
perf_total_per_op_us[           SCALE] =   0.404 ms
perf_total_per_op_us[             CPY] =   1.889 ms
perf_total_per_op_us[            CONT] = 131.453 ms
perf_total_per_op_us[         RESHAPE] =   0.121 ms
perf_total_per_op_us[            VIEW] =   0.420 ms
perf_total_per_op_us[         PERMUTE] =   0.244 ms
perf_total_per_op_us[       TRANSPOSE] =   0.060 ms
perf_total_per_op_us[        GET_ROWS] =   0.005 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.330 ms
perf_total_per_op_us[        SOFT_MAX] =   1.624 ms
perf_total_per_op_us[            ROPE] =   1.284 ms
perf_total_per_op_us[         REPEAT2] =  91.311 ms

The largest slowdown came from MUL_MAT, followed by CONT (131ms for token 300) then REPEAT2

CONT is a minor problem now but I also looked into caching of KV, I bet we can cache it more efficiently.
Nothing will in the end keep us from getting rid of the K and V repeater , that needs to go

John · Answer 8 · Mon Jun 26 2023 11:35:22 GMT+0800 (China Standard Time)

@jploski took the time figuring out the broadcasting PR to work with the interleaving repeat type multiplication
Boils down to a single line of code that took probably an evening of crunching tensors ?

That's a 1077 token generation on Falcon 40B q6_k:

falcon_print_timings:        load time = 15373.79 ms
falcon_print_timings:      sample time =   236.77 ms /  1024 runs   (    0.23 ms per token,  4324.82 tokens per second)
falcon_print_timings:        eval time = 117927.21 ms /  1077 runs   (  109.50 ms per token,     9.13 tokens per second)
falcon_print_timings:       total time = 118337.51 ms

The branch is not completely ready. It's still a mess with ggml differences and probably bugged in 10 places, but I need 8h of sleep

Really good job Jan :)

Jan Ploski · Answer 9 · Mon Jun 26 2023 23:54:31 GMT+0800 (China Standard Time)

I believe this terrible slowdown is caused by the incorrect utilization of the KV cache. As it stands with every new token all the previous tokens' query and key vectors are calculated and multiplied by each other, leading to quadratically increasing computation time. But I believe it would be sufficient to just process only the last token (or more generally, last N, taking prompt processing and batched token generation into consideration) and retrieve all the previous ones from cache.

Scratch that - it's already working as it should - the query vectors are only retrieved from cache for the new token(s), and they are multiplied by the key vectors of all the past tokens. So there is no redundant computation going on.

I got myself confused by comparing the KQ product shapes for prompt lengths n and n+1, but it is to be expected that KQ product shape increases quadratically with the initial prompt length, and it's a one-time operation.

It remains to check if the slowdown is simply because of the increasing context size (as stated above, linearly more key vectors to multiply by with each new token), or if there is still some other issue, but at this moment I see none.

I just committed a small fix (to my GGML falcon40b branch and to ggllm.cpp master), which eliminates an unnecessary 3d reshape op, but it did not substantially improve performance.

John · Answer 10 · Tue Jun 27 2023 09:05:03 GMT+0800 (China Standard Time)

Okay, I think the next real big step is to move more into CUDA.
Just checked your optimization, I'm not seeing a performance boost but it is certainly cleaner that way.

I've spent good 10 hours since yesterday, what looked like a quick integration turned into a multi-layered nightmare.
I had good 5 hours of garbage output that occassionally appeared, now I know .. Wizard fine tuning changes the shape of two tensors by 1.. I did not notice (or expect) that and it broke cuBLAS multiplications.
And when it worked the contemporary offload to cuBLAS wasted up to 4 GB of VRAM (and time).

I ended up rewriting the entire quantization tensors from 32 bit to 16 bit, "hacking" the cublas function to split into a 16 bit cuBLAS. When I was finished saving 50% of the VRAM I noticed the tensor shape :-)

Well, the good news is that the new kernels appear to work. cuBLAS operates in 16 bit now and the vram buffer management is also optimized.
Going to push the changes in a bit, hopefully not breaking too much.

Ah and benchmarks:
I am now getting up to 40 tokens/sec for single GPU 7B generation of 1000+ tokens
For 40B on single GPU (K4 quants) almost 18/sec at the 1000 mark.
And 55/sec and 24/sec for small generations

The PR is merged into Master now.
Given the enormous amount of changes (new kernels, new cublas, new vram calculation, a shady tensor modification for Wizard, cuda initizalization in a background thread, new cuda buffer management) I do expect a few crazy bugs.
Though I tested it on Win and Linux with a variety of models and it appears to work.

Jan Ploski · Answer 11 · Tue Jun 27 2023 17:00:21 GMT+0800 (China Standard Time)

Ah yes, "just one more little tweak" (turns into all night debugging session) seems to be an "emergent feature" of the large language models. ;-)

Anyhow, the only big slowdown I'm seeing now is during the context swap. After generating at a lightning pace it stops for multiple seconds, then resumes. It's not Falcon-related, just a side effect of llama.cpp's original "recompute half-the-context on swap" approach. I wonder whether this could be optimized away as well by modelling the kv cache memory as a ring buffer... but this would probably require some changes (at least) to the rope implementation as well to support "wrap-around" position indexes.

John · Answer 12 · Tue Jun 27 2023 23:06:12 GMT+0800 (China Standard Time)

That sounds like a quite cool idea, probably also better for consistentcy as the ringbuffer will just gradually change and not a instantaneous swap.
For now I'm just using a large -c context to avoid the swapping, ruins all metrics:)

John · Answer 13 · Wed Jun 28 2023 01:45:05 GMT+0800 (China Standard Time)

@jploski
After going through 50 loops I ended where I started, it appears I wrongly tested it first.
The broadcasting destroys batched processing once n_batch reaches 32 (in most cases, sometimes > 32 or < 32).

I just pushed a hotfix which disabled broadcasting during batching (back to repeating), it must be some nasty problem in the tensor shapes.
If it can be solved would be great, though I think the repeat is less damaging in the batched processing, only makes full GPU integration a bit harder.

Here is a test case:
falcon_main -t 4 -m falcon-7b\q4_1 -p "Who are all the 5 people named? The first one is John and we have another one called Paul and then we have Nina and Alexa lastly there is the famous sportsman and pop cult singer Dudu! Answer:" -n 64 --temp 0 --override-max-gpu -b 31
<= -b 31 usually works, >= -b32 fails

Jan Ploski · Answer 14 · Wed Jun 28 2023 02:01:19 GMT+0800 (China Standard Time)

@jploski After going through 50 loops I ended where I started, it appears I wrongly tested it first. The broadcasting destroys batched processing once n_batch reaches 32 (in most cases, sometimes > 32 or < 32).

I just pushed a hotfix which disabled broadcasting during batching (back to repeating), it must be some nasty problem in the tensor shapes. If it can be solved would be great, though I think the repeat is less damaging in the batched processing, only makes full GPU integration a bit harder.

Here is a test case: falcon_main -t 4 -m falcon-7b\q4_1 -p "Who are all the 5 people named? The first one is John and we have another one called Paul and then we have Nina and Alexa lastly there is the famous sportsman and pop cult singer Dudu! Answer:" -n 64 --temp 0 --override-max-gpu -b 31 <= -b 31 usually works, >= -b32 fails

What do you mean by "fails"?

I just tried the pre-hotfix commit d94c88d using a q5_1 version of 7B (and use --override-max-gpu 1 - is that what you meant?). It produced roughly the same output with -b 31 -b 32 and -b 33 (I think the differences are because of cuBLAS).

Edit: nevermind, I see the issue (garbage generated with large batch size)

John · Answer 15 · Wed Jun 28 2023 02:18:05 GMT+0800 (China Standard Time)

@jploski
Sorry the "0" was capped.
I meant override to 0 to prevent the layer offloading (which takes time for each test).

I ran good 100 tests on various models and they always lose context with broadcast, I tested 40B and 7B, 16 bit , normal and K quants all the same issue.

In that test case with falcon-7b/q4_1:
The normal response I have with temp=0 is that it repeats the whole prompt.
The response at -b >= 32 is none (end of text) -b 31 -> repeats the prompt

 ./bin/falcon_main  -m /root/q4_1 -n 512 -p "Who are all the 5 people named? The first one is John and we have another one called Paul and then we have Nina and Alexa l
astly there is the famous sportsman and pop cult singer Dudu!" -n 64  -ngl 0 -n 500 -b 100 --temp 0
...
Who are all the 5 people named? The first one is John and we have another one called Paul and then we have Nina and Alexa lastly there is the famous sportsman and pop cult singer Dudu!<|endoftext|> [end of text]

falcon_print_timings:        load time =  1966.05 ms


./bin/falcon_main  -m /root/q4_1 -n 512 -p "Who are all the 5 people named? The first one is John and we have another one called Paul and then we have Nina and Alexa lastly there is the famous sportsman and pop cult singer Dudu!" -n 64  -ngl 0 -n 500 -b 10 --temp 0
...
Who are all the 5 people named? The first one is John and we have another one called Paul and then we have Nina and Alexa lastly there is the famous sportsman and pop cult singer Dudu!
Who are all the 5 people named? The first one is John and we have another one

I think I ran a hundred tests, almost all of them with the same problem, when the prompt is large enough and it's processed in batch mode the whole context is lost. Sometimes I just has "DDDDDDDDDDD" as result, usually it was language.

Edit: I see you noticed it. I was really worried it only happens locally :)
It is easy to miss because the output usually is good language and the prompt needs to exceed length, so I missed it in all tests yesterday.

Jan Ploski · Answer 16 · Wed Jun 28 2023 02:47:23 GMT+0800 (China Standard Time)

Edit: I see you noticed it. I was really worried it only happens locally :) It is easy to miss because the output usually is good language and the prompt needs to exceed length, so I missed it in all tests yesterday.

I can see it even with the falcon-mini-shakespeare model. It seems to depend not just on the batch size, but also on the prompt length. Specifically with prompt length of 32 tokens I see the problem appearing with -b 32 (not -b 31). But removing a single token from the prompt (making it 31 tokens) makes -b 32 (and higher) work again.

My test case:

bin/falcon_main -s 42 -t 1 -m /mnt/seagate/miniconda3/falcon/falcon/ggml-model-falcon-f32.bin -p "When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we" -n 64 --temp 0 --override-max-gpu 0 --top_p 1 --top_k 1 -b 32
When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we>>TITLE<< the the the my the be will to the theRYRY the the the the to to to to to the to to to to like the the in'' the the the that to be be be' the the the

John · Answer 17 · Wed Jun 28 2023 02:49:49 GMT+0800 (China Standard Time)

Edit: I see you noticed it. I was really worried it only happens locally :) It is easy to miss because the output usually is good language and the prompt needs to exceed length, so I missed it in all tests yesterday.

I can see it even with the falcon-mini-shakespeare model. It seems to depend not just on the batch size, but also on the prompt length. Specifically with prompt length of 32 tokens I see the problem appearing with -b 32 (not -b 31). But removing a single token from the prompt (making it 31 tokens) makes -b 32 (and higher) work again.

My test case:
bin/falcon_main -s 42 -t 1 -m /mnt/seagate/miniconda3/falcon/falcon/ggml-model-falcon-f32.bin -p "When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we" -n 64 --temp 0 --override-max-gpu 0 --top_p 1 --top_k 1 -b 32
When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we>>TITLE<< the the the my the be will to the theRYRY the the the the to to to to to the to to to to like the the in'' the the the that to be be be' the the the

Yes it's starting at "-b 32". Just if you have a prompt lower than 32 then you can not batch process it with 32, so that automatically reduces the n_batch down to the prompt length.
It must be something about the tensor shapes that makes the broadcast index fail

Jan Ploski · Answer 18 · Wed Jun 28 2023 02:50:24 GMT+0800 (China Standard Time)

BTW I noticed Makefile generated with -DLLAMA_CUBLAS=0 does not compile, possibly another problem.

John · Answer 19 · Wed Jun 28 2023 02:51:51 GMT+0800 (China Standard Time)

BTW I noticed Makefile generated with -DLLAMA_CUBLAS=0 does not compile, possibly another problem.

Appears to work for me, can you show me the error ?

export LLAMA_CUBLAS=0
make

Jan Ploski · Answer 20 · Wed Jun 28 2023 02:55:33 GMT+0800 (China Standard Time)

BTW I noticed Makefile generated with -DLLAMA_CUBLAS=0 does not compile, possibly another problem.

Appears to work for me, can you show me the error ?
export LLAMA_CUBLAS=0
make

Sorry, it also seems to have something to do with Debug build type (fails only with both options):

cmake .. -DLLAMA_CUBLAS=0 -DCMAKE_BUILD_TYPE=Debug
make
/tmp/ggllm.cpp/libfalcon.cpp: In function ‘void falcon_model_load_internal(const string&, falcon_context&, int, int, int, int, ggml_type, bool, bool, bool, falcon_progress_callback, void*)’:
/tmp/ggllm.cpp/libfalcon.cpp:1432:111: error: ‘system_gpu_status’ was not declared in this scope
 yer + vram_scratch + vram_reserved + vram_overhead + vram_output) > system_gpu_status->total_free_vram ?

John · Answer 21 · Wed Jun 28 2023 03:03:45 GMT+0800 (China Standard Time)

@jploski just pushed a bugfix for it

Jan Ploski · Answer 22 · Wed Jun 28 2023 03:04:23 GMT+0800 (China Standard Time)

Edit: I see you noticed it. I was really worried it only happens locally :) It is easy to miss because the output usually is good language and the prompt needs to exceed length, so I missed it in all tests yesterday.

I can see it even with the falcon-mini-shakespeare model. It seems to depend not just on the batch size, but also on the prompt length. Specifically with prompt length of 32 tokens I see the problem appearing with -b 32 (not -b 31). But removing a single token from the prompt (making it 31 tokens) makes -b 32 (and higher) work again.
My test case:
bin/falcon_main -s 42 -t 1 -m /mnt/seagate/miniconda3/falcon/falcon/ggml-model-falcon-f32.bin -p "When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we" -n 64 --temp 0 --override-max-gpu 0 --top_p 1 --top_k 1 -b 32
When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we>>TITLE<< the the the my the be will to the theRYRY the the the the to to to to to the to to to to like the the in'' the the the that to be be be' the the the
Yes it's starting at "-b 32". Just if you have a prompt lower than 32 then you can not batch process it with 32, so that automatically reduces the n_batch down to the prompt length. It must be something about the tensor shapes that makes the broadcast index fail

I note it fails in the same way (judging by model output) even with the "interleaved" broadcast.

Jan Ploski · Answer 23 · Wed Jun 28 2023 03:35:49 GMT+0800 (China Standard Time)

@jploski just pushed a bugfix for it

Thanks - I noticed that with -DLLAMA_CUBLAS=0 the -b 32 problem is not reproducible.

John · Answer 24 · Wed Jun 28 2023 03:52:38 GMT+0800 (China Standard Time)

@jploski just pushed a bugfix for it

Thanks - I noticed that with -DLLAMA_CUBLAS=0 the -b 32 problem is not reproducible.

daaaamn !
The cuda can_mul_mat has a "32" switch, it's just a small number Johannes chose as minimal size.
And the gpu disabler does not seem to work reliable for the n_batch
Good find!

It does not really answer why this fails but at least we know where!

Jan Ploski · Answer 25 · Wed Jun 28 2023 04:20:18 GMT+0800 (China Standard Time)

@jploski just pushed a bugfix for it

Thanks - I noticed that with -DLLAMA_CUBLAS=0 the -b 32 problem is not reproducible.

daaaamn ! The cuda can_mul_mat has a "32" switch, it's just a small number Johannes chose as minimal size. And the gpu disabler does not seem to work reliable for the n_batch Good find!

It does not really answer why this fails but at least we know where!

Yes, that might be it. I noticed that the mini-model fails with -DLLAMA_CUBLAS=1 even if I disable broadcasting and use ggml_repeat2 exclusively.

bin/falcon_main -s 42 -t 1 -m /mnt/seagate/miniconda3/falcon/falcon/ggml-model-falcon-f32.bin -p "When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we loop When we" -n 64 --temp 1 --override-max-gpu 0 -ngl 0 --top_p 1 --top_k 1 -b 32

For the record, I also checked with valgrind/memcheck, which did not detect any illegal memory accesses.

To make valgrind work I had to use

cmake .. -DCMAKE_BUILD_TYPE=Debug -DLLAMA_NATIVE=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_F16C=OFF -DLLAMA_FMA=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX512_VBMI=OFF -DLLAMA_AVX512_VNNI=OFF

John · Answer 26 · Wed Jun 28 2023 05:11:05 GMT+0800 (China Standard Time)

Glad to hear about valgrind, I didn't use that for more than a decade. I was suspecting a problem of that sort.
the shakespeare model did run fine for me on cuda with -b 1 so far (will test again)

I'm getting a ggml dequantizer segfault when adding return false; into ggml_cuda_compute_forward() (cuda flag enabled)
That's super strange too. Putting "false" there should just disable the cuda compute and do all on CPU (given I use ngl 0 additionally)

John · Answer 27 · Wed Jun 28 2023 08:02:04 GMT+0800 (China Standard Time)

I just pushed an update patch.
I added a functionality to force a tensor into CPU (or GPU) and when broadcasting is used, the K and KQ tensors are both flagged to not be GPU touched. This can later be extended to just force them out of cuBLAS (once broadcasting is added to vec_mul_mat kernels)

I also experimented with prompts from 10-500 tokens and CPU based broadcasting was always faster than repeat + cuBLAS.
I added the option to switch broadcasting off during n_batch for cuBLAS but it's always off now.

Thanks for your debugging, I was not aware about the indirect jump into cuBLAS (in ggml.c that's a kind of unclean indirection that skips the whole ggml_cuda_operation loop). Now switching GPUs off is really switching them off for all operations.