[ HGEMM ] Half-Precision GEMM Roadmap

Question

[ HGEMM ] Half-Precision GEMM Roadmap

skykongkong8 opened this issue 2 months ago · comments

Sungsik Kong commented 2 months ago

1. Objective

Aim of this project is to implement optimal half-precision GEMM working on armv8.2 using NEON.

2. Roadmap

Suppose a GEMM case s.t.

$$A( M , K ) * B( K , N ) = C( M , N )$$

Step1. Vanilla HGEMM

vanilla implementation of half-precision GEMM with NEON

Step2. Kernel-based HGEMM

GEMM with no transpose : A * B = C
GEMM with transpose A : A.T * B = C
GEMM with transpose B : A * B.T = C
GEMM with transpose AB : A.T * B.T = C
GEMM with scale (alpha, beta) : C = C * beta + A * ( alpha * B )

Step3. Advanced optimization

Not necessarily, but perhaps we might need them (?)

fused HGEMM with activation
asm-based kernel

3. Keep in mind that...

1. Concerns about precision

nvidia fp16 paper
- Tensor Cores, evenly distributed across 80 multiprocessors.
  Each Tensor Core possesses a mixed-precision 4×4×4 matrix
  processing array which performs the operation D = A×B+C,
  where A, B, C and D are 4 × 4 matrices. The inputs A and
  B must be represented in FP16 format, while C and D can
  be represented in FP16 or in FP32 formats. It is also possible
  that C and D point to the same matrix.
hyperclova
gemmlowp
- at uint_16-32 GEMM, they use up to 16 * ACC24 (don't know why)

2. Justification of optimal GEMM implementation

goto paper

taos-ci · Answer 1 · Tue May 14 2024 09:59:54 GMT+0800 (China Standard Time)

cibot: Thank you for posting issue #2583. The person in charge will reply soon.

Jijoong Moon · Answer 2 · Thu May 16 2024 12:53:15 GMT+0800 (China Standard Time)

It might be better to refer to the PR number for each finished item.
I agree about Step 3. We can delay it when we have enough time.

Sungsik Kong · Answer 3 · Thu May 16 2024 12:57:51 GMT+0800 (China Standard Time)

It might be better to refer to the PR number for each finished item. I agree about Step 3. We can delay it when we have enough time.

Right.. but for detailed process update, I am managing them with >Projects/Half-Precision GEMM
Furthermore, I will definitely going to mention this issue for every PR related.

Sungsik Kong · Answer 4 · Thu May 23 2024 10:25:40 GMT+0800 (China Standard Time)

Anyone who want to discuss further about this issue can reopen this issue.
Close temporally, but will be updated time-to-time.