[ GEMM ] HGEMM noTrans case
skykongkong8 opened this issue · comments
Latest NNTrainer (26.02.24) HGEMM does not consider TLB cache.
I believe this is one of the most suspicious reasons for having nonlinear latency in big sized GEMM computation.
I am currently implementing micro, and macro scale HGEMM kernels in order to resolve this issue : skykongkong8@ac9cbb5
and observing some improvements in latency without deterioration of accuracy (w.r.t. full-fp16 HGEMM).
[ outdated ]
- v5 : apply SIMD in Brute Force implementation
- v6 : v5 + loop unrolling
- v7 : v8 + bigger kernel
- v8 : v7 + cache blocking
- v9 : v8 + packing
- v10 : v9 + bigger kernel + adaptive kernel use
- v11 : v10 + discontinuous packing on B
- [ WIP ] v10_modified : v10 + small kernels + adaptive data packing
[ current status ] : #2531 , #2541 , #2578
- continuous data packing
- 4x4, 4x8, 8x8 hgemm kernel for f16 and f16-f32
- software prefetching
Those who want to make comments / reviews on WIP branch, please leave it here, or let me know! :)
@s-debadri As we discussed, please get started from here :) Thanks a lot!
Current Status : 08.04.2024
Unittest output using Galaxy S23 with #2541
GEMM dimension | fp32 | prev | 8x8 | f16-f32 8x16 | full-f16 |
---|---|---|---|---|---|
4096 square | 2087 ms | 7172 ms | ... | 1964 ms | 1452 ms |
2048 square | 260 ms | 413 ms | ... | 250 ms | 185 ms |
1024 square | 34 ms | 52 ms | ... | 30 ms | 103 ms |
768 square | 13 ms | 18 ms | ... | 11 ms | 10 ms |
256X1440X256 | 2869 mcrs | 3807 mcrs | ... | 2544 mcrs | 2055 mcrs |
256X256X1440 | 2929 mcrs | 3950 mcrs | ... | 2467 mcrs | 2523 mcrs |
8X1440X8 | 5 mcrs | 5 mcrs | ... | 10 mcrs | |
8X8X1440 | 5 mcrs | 4 mcrs | ... | 8 mcrs |
Status Update: 24.04.2024
- Macro style kernel
- Adaptive loops for macros
- More digits per loop
Unittest output using Galaxy S23 with local commit (TBA)
Latency
mean latency with TC = 100
dim | KERNEL_8x16_ACC16 | KERNEL_8x16_ACC8 | cblas fp32 |
---|---|---|---|
1024 | 23 ms | 30 ms | 32 ms |
768 | 9 ms | 12.8 ms | 13.6 ms |
256x1440x256 | 2054 mcrs | 2664 mcrs | 2701 mcrs |
256x256x1440 | 2359 mcrs | 2965 mcrs | 3104 mcrs |
mse w.r.t. sgemm
dim | KERNEL_8x16_ACC16 | KERNEL_8x16_ACC8 |
---|---|---|
1024 | 0.00608169 | 0.00226737 |
768 | 0.00310214 | 0.0017091 |
256x1440x256 | 0.0149112 | 0.00518965 |
256x256x1440 | 0.00119428 | 0.000306849 |
- Overall, this shows 150% boost-up with f16-f32 w.r.t. cblas fp32
- Considering enlarged vector length from f32 to f16, and partial-accumulation, result above sounds reasonable.
- However, this code takes a little bit of accuracy loss for its cost. Should be checked once more with model output.
This issue is temporally resolved, and can be discussed in other issues.