Some confusions of the GlobalToPrivateA2D kernel
lsl036 opened this issue · comments
Hello! I have studied your GEMM tutorial before and now I start learning the implementation of the CLBLASt GEMM.
Here are some confusions of me in the GlobalToPrivateA2D
kernel.
-
The first thing is I don't understand how Global A micro-tile loads into registers. The
apm
is NWI*(KREG/VWN), the matrix A is M*K, buta_index
is calculated withkSizeK
as the leading dimension:
const int a_index = (tid_y * NWI + _ni) * kSizeK + idk + _ki * VWN;
Is A transposed somewhere? (In the case of ColMajor and TransNo), and how doesksizeM
divided properly byNWI
? -
As for the
GlobalToPrivateB2D
kernel, the B is transposed so here B is N by K. So the process of gettingbpm
is much clearer. But still, is theksizeN
dimension divisible byMWI
?
Now I suppose VWM=VWN=4
, KREG=4, KWI=1
, MWI=4
, and NWI=8
. Both matrices are Colmajor and noTrans. And I do a simple analysis:
Am I understanding the bpm
operation correctly? And how does A get apm
by this kernel?
Why do these two kernels exchange the MWI and NWI of A and B?
Any explanations would be very helpful. Thank you very much!
Thank you for your interest and apologies for the late reply - I'm a bit busy these days. I'll also give a short answer with some pointers for now, I don't have time to go into details.
First of all I assume you've looked at the paper first? That might give you some answers. Also the referenced paper by Matsumoto et al might help a bit in explaining things.
Is A transposed somewhere?
Yes indeed A is pre-transposed, see these lines of code. CLBlast runs some pre-processing kernels, such that matrices are always in the format expected: there is only one 'indirect' GEMM kernel that assumes a transposed matrix A and a non-transposed matrix B and C.
But still, is the ksizeN dimension divisible by MWI?
This is also guaranteed because of the pre-transpose kernels: they also do padding with zeros if needed. That way we can guarantee all kinds of such things, which makes the kernel code simpler and faster, at the cost of a few extra multiplications with zero. The tuners furthermore set all kinds of other constraints on these parameters, but that doesn't include the user-settable parameters such as ksizeN
of course.
Hope this helps already!
Yeah, I have read the paper before.
Recently I notice the pre-transpose kernel will guarantee the operations.
Thank you for your detailed answer!