Some confusions of the GlobalToPrivateA2D kernel

Question

Some confusions of the GlobalToPrivateA2D kernel

lsl036 opened this issue 2 years ago · comments

Hello! I have studied your GEMM tutorial before and now I start learning the implementation of the CLBLASt GEMM.
Here are some confusions of me in the GlobalToPrivateA2D kernel.

The first thing is I don't understand how Global A micro-tile loads into registers. The apm is NWI*(KREG/VWN), the matrix A is M*K, but a_index is calculated with kSizeK as the leading dimension:
const int a_index = (tid_y * NWI + _ni) * kSizeK + idk + _ki * VWN;
Is A transposed somewhere? (In the case of ColMajor and TransNo), and how does ksizeM divided properly by NWI?
As for the GlobalToPrivateB2D kernel, the B is transposed so here B is N by K. So the process of getting bpm is much clearer. But still, is the ksizeN dimension divisible by MWI?

Now I suppose VWM=VWN=4, KREG=4, KWI=1, MWI=4, and NWI=8. Both matrices are Colmajor and noTrans. And I do a simple analysis:

Am I understanding the bpm operation correctly? And how does A get apm by this kernel?
Why do these two kernels exchange the MWI and NWI of A and B?
Any explanations would be very helpful. Thank you very much!

Cedric Nugteren · Answer 1 · Mon Dec 19 2022 16:05:42 GMT+0800 (China Standard Time)

Thank you for your interest and apologies for the late reply - I'm a bit busy these days. I'll also give a short answer with some pointers for now, I don't have time to go into details.

First of all I assume you've looked at the paper first? That might give you some answers. Also the referenced paper by Matsumoto et al might help a bit in explaining things.

Is A transposed somewhere?

Yes indeed A is pre-transposed, see these lines of code. CLBlast runs some pre-processing kernels, such that matrices are always in the format expected: there is only one 'indirect' GEMM kernel that assumes a transposed matrix A and a non-transposed matrix B and C.

But still, is the ksizeN dimension divisible by MWI?

This is also guaranteed because of the pre-transpose kernels: they also do padding with zeros if needed. That way we can guarantee all kinds of such things, which makes the kernel code simpler and faster, at the cost of a few extra multiplications with zero. The tuners furthermore set all kinds of other constraints on these parameters, but that doesn't include the user-settable parameters such as ksizeN of course.

Hope this helps already!

lsl036 · Answer 2 · Tue Dec 20 2022 10:16:31 GMT+0800 (China Standard Time)

Yeah, I have read the paper before.
Recently I notice the pre-transpose kernel will guarantee the operations.
Thank you for your detailed answer!