nnstreamer / nntrainer

NNtrainer is Software Framework for Training Neural Network Models on Devices.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[ GEMM ] HGEMM noTrans case

skykongkong8 opened this issue · comments

Latest NNTrainer (26.02.24) HGEMM does not consider TLB cache.
I believe this is one of the most suspicious reasons for having nonlinear latency in big sized GEMM computation.
I am currently implementing micro, and macro scale HGEMM kernels in order to resolve this issue : skykongkong8@ac9cbb5
and observing some improvements in latency without deterioration of accuracy (w.r.t. full-fp16 HGEMM).

[ outdated ]

  • v5 : apply SIMD in Brute Force implementation
  • v6 : v5 + loop unrolling
  • v7 : v8 + bigger kernel
  • v8 : v7 + cache blocking
  • v9 : v8 + packing
  • v10 : v9 + bigger kernel + adaptive kernel use
  • v11 : v10 + discontinuous packing on B
  • [ WIP ] v10_modified : v10 + small kernels + adaptive data packing

[ current status ] : #2531 , #2541 , #2578

  1. continuous data packing
  2. 4x4, 4x8, 8x8 hgemm kernel for f16 and f16-f32
  3. software prefetching

:octocat: cibot: Thank you for posting issue #2488. The person in charge will reply soon.

Those who want to make comments / reviews on WIP branch, please leave it here, or let me know! :)

@s-debadri As we discussed, please get started from here :) Thanks a lot!

Current Status : 08.04.2024

Unittest output using Galaxy S23 with #2541

GEMM dimension fp32 prev 8x8 f16-f32 8x16 full-f16
4096 square 2087 ms 7172 ms ... 1964 ms 1452 ms
2048 square 260 ms 413 ms ... 250 ms 185 ms
1024 square 34 ms 52 ms ... 30 ms 103 ms
768 square 13 ms 18 ms ... 11 ms 10 ms
256X1440X256 2869 mcrs 3807 mcrs ... 2544 mcrs 2055 mcrs
256X256X1440 2929 mcrs 3950 mcrs ... 2467 mcrs 2523 mcrs
8X1440X8 5 mcrs 5 mcrs ... 10 mcrs
8X8X1440 5 mcrs 4 mcrs ... 8 mcrs

Status Update: 24.04.2024

  • Macro style kernel
  • Adaptive loops for macros
  • More digits per loop

Unittest output using Galaxy S23 with local commit (TBA)

Latency

mean latency with TC = 100

dim KERNEL_8x16_ACC16 KERNEL_8x16_ACC8 cblas fp32
1024 23 ms 30 ms 32 ms
768 9 ms 12.8 ms 13.6 ms
256x1440x256 2054 mcrs 2664 mcrs 2701 mcrs
256x256x1440 2359 mcrs 2965 mcrs 3104 mcrs

mse w.r.t. sgemm

dim KERNEL_8x16_ACC16 KERNEL_8x16_ACC8
1024 0.00608169 0.00226737
768 0.00310214 0.0017091
256x1440x256 0.0149112 0.00518965
256x256x1440 0.00119428 0.000306849
  • Overall, this shows 150% boost-up with f16-f32 w.r.t. cblas fp32
  • Considering enlarged vector length from f32 to f16, and partial-accumulation, result above sounds reasonable.
  • However, this code takes a little bit of accuracy loss for its cost. Should be checked once more with model output.

This issue is temporally resolved, and can be discussed in other issues.