[ GEMM ] HGEMM noTrans case

Question

[ GEMM ] HGEMM noTrans case

skykongkong8 opened this issue 4 months ago · comments

Latest NNTrainer (26.02.24) HGEMM does not consider TLB cache.
I believe this is one of the most suspicious reasons for having nonlinear latency in big sized GEMM computation.
I am currently implementing micro, and macro scale HGEMM kernels in order to resolve this issue : skykongkong8@ac9cbb5
and observing some improvements in latency without deterioration of accuracy (w.r.t. full-fp16 HGEMM).

[ outdated ]

v5 : apply SIMD in Brute Force implementation
v6 : v5 + loop unrolling
v7 : v8 + bigger kernel
v8 : v7 + cache blocking
v9 : v8 + packing
v10 : v9 + bigger kernel + adaptive kernel use
v11 : v10 + discontinuous packing on B
[ WIP ] v10_modified : v10 + small kernels + adaptive data packing

[ current status ] : #2531 , #2541 , #2578

continuous data packing
4x4, 4x8, 8x8 hgemm kernel for f16 and f16-f32
software prefetching

taos-ci · Answer 1 · Mon Feb 26 2024 16:29:01 GMT+0800 (China Standard Time)

cibot: Thank you for posting issue #2488. The person in charge will reply soon.

Sungsik Kong · Answer 2 · Mon Feb 26 2024 16:32:16 GMT+0800 (China Standard Time)

Those who want to make comments / reviews on WIP branch, please leave it here, or let me know! :)

Sungsik Kong · Answer 3 · Fri Mar 15 2024 15:04:26 GMT+0800 (China Standard Time)

@s-debadri As we discussed, please get started from here :) Thanks a lot!

Sungsik Kong · Answer 4 · Mon Apr 08 2024 12:16:57 GMT+0800 (China Standard Time)

Current Status : 08.04.2024

Unittest output using Galaxy S23 with #2541

GEMM dimension	fp32	prev	8x8	f16-f32 8x16	full-f16
4096 square	2087 ms	7172 ms	...	1964 ms	1452 ms
2048 square	260 ms	413 ms	...	250 ms	185 ms
1024 square	34 ms	52 ms	...	30 ms	103 ms
768 square	13 ms	18 ms	...	11 ms	10 ms
256X1440X256	2869 mcrs	3807 mcrs	...	2544 mcrs	2055 mcrs
256X256X1440	2929 mcrs	3950 mcrs	...	2467 mcrs	2523 mcrs
8X1440X8	5 mcrs	5 mcrs	...	10 mcrs
8X8X1440	5 mcrs	4 mcrs	...	8 mcrs

Sungsik Kong · Answer 5 · Thu Apr 25 2024 13:06:05 GMT+0800 (China Standard Time)

Status Update: 24.04.2024

Macro style kernel
Adaptive loops for macros
More digits per loop

Unittest output using Galaxy S23 with local commit (TBA)

Latency

mean latency with TC = 100

dim	KERNEL_8x16_ACC16	KERNEL_8x16_ACC8	cblas fp32
1024	23 ms	30 ms	32 ms
768	9 ms	12.8 ms	13.6 ms
256x1440x256	2054 mcrs	2664 mcrs	2701 mcrs
256x256x1440	2359 mcrs	2965 mcrs	3104 mcrs

mse w.r.t. sgemm

dim	KERNEL_8x16_ACC16	KERNEL_8x16_ACC8
1024	0.00608169	0.00226737
768	0.00310214	0.0017091
256x1440x256	0.0149112	0.00518965
256x256x1440	0.00119428	0.000306849

Overall, this shows 150% boost-up with f16-f32 w.r.t. cblas fp32
Considering enlarged vector length from f32 to f16, and partial-accumulation, result above sounds reasonable.
However, this code takes a little bit of accuracy loss for its cost. Should be checked once more with model output.

Sungsik Kong · Answer 6 · Tue May 14 2024 10:00:59 GMT+0800 (China Standard Time)

This issue is temporally resolved, and can be discussed in other issues.