google / gemmlowp

I am studying the kernels implemented in gemmlowp, and noticed a possible discrepancy in the KernelFormat of the AVX2 kernel, i.e. here:

gemmlowp/internal/kernel_avx.h

Line 33 in 1490d29

KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 1>>

So if I understand CellFormat<4,2> correctly, width is 4 (corresponding the the number of columns on the RHS), and depth is 2 (rows in the RHS), meaning we have a 2x4 matrix. Further, `CellOrder::WidthMajor for the RHS implies column order storage, so consecutive increments should be placed in consecutive rows:

1 3 5 7
2 4 6 8

The comment in kernel_avx.h says:

A 2x8 cell of Rhs is stored in 16bit in ymm1 ,

So here is already one discrepancy: the format described by the template is 2x4, not 2x8.

We see the next discrepancy in the inline asm code in kernel_avx.h contains this:

gemmlowp/internal/kernel_avx.h

Line 113 in 1490d29

"vpmovzxbw (%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

The command vpmovzxbw corresponds to _mm256_cvtepu8_epi16 (__m128i a), which reads 16 bytes (16 * 8 = 128), and sign extends them to 16 i16 (16 * 16 = 256, the amount of bits fitting in a ymm* register). But the template only allows for 8 bytes! So my intuition would be that the comment should correct, given that it defines a matrix of 16 elements. Yet I don't understand how the code apparently seems to work if we only have 8 bytes for the RHS?

Finally, we have:

gemmlowp/internal/kernel_avx.h

Line 180 in 1490d29

"vpmovzxbw 0x08(%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

Here we jump to the 8th element in the kernel. But according to the template this should be outside of the defined region of RHS?

I tried to correct the template to work according to my expectations and according to the comments, i.e. I set KernelSideFormat<CellFormat<8, 2, CellOrder::WidthMajor>, 1>>, but that lead to the tests erroring.

Can somebody comment on my observation and rectify my misunderstanding of the code? I guess it seems to work, but I don't understand how. Specifically, how is the code not reading past the defined memory regions?

On the other hand, given that we have 16 ymm* registers on AVX machines, I also understand what the code itself tries to accomplish: by having 4 columns of on the RHS, we can broadcast two rows at a time to a packed vector, and combine it with one cell of the LHS. This way, we only need 4 registers to accumulate one cell block of the LHS with all of RHS, which only has one cell and will thus always stay in ymm1. This allows the entire operation to be had in the 16 ymm* registers without a single register spill.

I just don't understand how an AVX load works with 8 bytes only.

Come to think more about this, I realize that I was mislead by one of the asm instructions. I think it's not vpmovzxbw that is required, but rather vpmovzxbd. vpmovzxbd indeed loads 8 u8 into one 256 bit vector, zero extending them to 8 i32.

I am not completely clear yet how the interaction with all the other operations is, but at least this doesn't violate my assumption about which memory is being accessed (in particular, reading past the cell as defined in the template).

Not completely clear to me are also the “lower half” comments in above the loads of rhs_ptr. What about the upper half?

Pinging @rrwinterton who introduced the kernel: #122

I'll let @rrwinterton comment as I'm not familiar with this code.

Generic info: the general meaning of the kernel format concepts here is explained in
https://github.com/google/gemmlowp/blob/master/internal/kernel.h
and a reference kernel implementation for an arbitrary format is given in
https://github.com/google/gemmlowp/blob/master/internal/kernel_reference.h

I came to realize that I was trying to understand the kernel devoid of the packing mechanism. I now realize that that doesn't work. I have to wrap my head around how exactly the matrices are packed, but I am pretty sure the answer to my problem is there.

Closing for now, let us know if any new question comes up.

Thanks bjacob I didn't get the email of the previous questions in December. If there are any questions you may have after reviewing the packing mechanism and the concepts documents bjacob provided let me know. I will watch the emails more closely.
Rich

CellFormat in AVX2 kernel incorrect? Question for clarification