cvttps and AVX2?

Question

cvttps and AVX2?

jan-wassenberg opened this issue a year ago · comments

Howdy Rich, very nice :)
Have you considered _mm_cvttps_epi32? I believe that has the truncation built in, so no need to change the rounding mode.

Also, do you think it would help to use AVX2 and its gather instruction for the table lookups? That would avoid the extract+insert.

If you'd prefer to avoid writing two copies of the code, I'd be happy to help with porting to github.com/google/highway.

Rich Geldreich · Answer 1 · Thu Apr 27 2023 04:29:59 GMT+0800 (China Standard Time)

Have you considered _mm_cvttps_epi32? I believe that has the truncation built in, so no need to change the rounding mode.

Thanks - that's a good idea. I overlooked that one. The only place that requires the rounding mode to be set to truncation is here, for the _mm_cvtps_epi32():
__m128i q = _mm_cvtps_epi32(_mm_div_ps(_mm_cvtepi32_ps(arith_value), _mm_cvtepi32_ps(r)));

I'll fix the code and give you credit for the suggestion. Thanks! (I almost gave up several times trying to get this to work and be efficient, so the rounding mode thing is an artifact of that process of figuring this out.)

I tried using AVX2 gathers - on Ice Lake they are slower than the extracts/inserts. This seems to be the consensus after some searching. It should be possible to use AVX2 to do 32 streams, though.

Rich Geldreich · Answer 2 · Thu Apr 27 2023 05:22:22 GMT+0800 (China Standard Time)

I've checked in this improvement.

Rich Geldreich · Answer 3 · Thu Apr 27 2023 09:47:59 GMT+0800 (China Standard Time)

If you'd prefer to avoid writing two copies of the code, I'd be happy to help with porting to github.com/google/highway.

I'm not worried about 2 copies. Relatively little effort has been invested into vectorizing range coding, and this repo is just an example of what's possible.

Rich Geldreich · Answer 4 · Thu Apr 27 2023 10:10:06 GMT+0800 (China Standard Time)

I just checked in an efficiency fix to test.cpp. The 'c' command could generate files which were too large if the input file size was >65535 bytes. This didn't impact the lower-level codec itself or the test mode, and the resulting compressed file was still correct and decompressed correctly.

Jan Wassenberg · Answer 5 · Thu Apr 27 2023 16:45:44 GMT+0800 (China Standard Time)

Nice, glad to see you incorporated cvttps :)

I tried using AVX2 gathers - on Ice Lake they are slower than the extracts/inserts.

Interesting, that's a surprise. One potential confounder is that clang, with AVX-512 build flags at least,
actually compiles the extract+insert into a gather! https://gcc.godbolt.org/z/5j4b3z64n

It should be possible to use AVX2 to do 32 streams, though.

Yes, I figure that would be a sizeable win. 256-bit vdivps has the same latency as 128-bit on SKX, so we'd be bringing more hardware to bear.
(By contrast, the 512-bit div on SKX appears to be double-pumping that same hardware.)

Rich Geldreich · Answer 6 · Fri Apr 28 2023 08:48:52 GMT+0800 (China Standard Time)

Interleaved Huffman using SSE4.1 with 2 key AVX2 instructions - around 1060 MiB/sec. on 8-bit alphabets (book1). Uses 13-bit max code size and a 16KB table.

Unfortunately it needs _mm_srlv_epi32 and _mm_sllv_epi32. Emulating them with pure SSE4.1 will be costly and possibly not worth the effort vs. optimized scalar Huffman. Nearly everything about SSE/AVX fights to defeat you.

Rich Geldreich · Answer 7 · Fri Apr 28 2023 10:00:49 GMT+0800 (China Standard Time)

Worked around the problem by reversing the Huffman codes (JPEG-style, not zlib-style). So all the variable lane shifts are always left, not right/left. On SSE4.1 it gets ~830MiB/sec., AVX2 ~1070 MiB/sec. This should also scale well to 32 lanes vs. 16 lanes using entirely AVX2.

Rich Geldreich · Answer 8 · Fri Apr 28 2023 10:25:06 GMT+0800 (China Standard Time)

I've added benchmarks against ryg_rans and Collet's FSE benchmark - they are quite interesting:
https://github.com/richgel999/sserangecoding/blob/main/README.md

ryg_rans gets the smallest file, but it's not performing as well as I would expect. On this machine (Ice Lake) interleaved range coding is faster.

Jan Wassenberg · Answer 9 · Fri Apr 28 2023 16:54:38 GMT+0800 (China Standard Time)

Congrats on the speed win!

Emulating them with pure SSE4.1 will be costly and possibly not worth the effort vs. optimized scalar Huffman. Nearly everything about SSE/AVX fights to defeat you.

Hah, indeed. Highway is a shield that protects us from most of the unpleasantness, and filling in such gaps.
Here's our implementation of sllv in case it's useful: https://github.com/google/highway/blob/master/hwy/ops/x86_128-inl.h#L5597

Rich Geldreich · Answer 10 · Fri Apr 28 2023 23:25:32 GMT+0800 (China Standard Time)

Thanks - looks like the Highway code is very similar to what I'm doing to emulate sllv. I have examined both the range and Huffman decoders and I don't see any reason why they can't scale to 32 lanes using AVX2. That's next.

#define USE_AVX2 1
static __forceinline __m128i shift_left(__m128i a, __m128i b)
{
#if USE_AVX2
return _mm_sllv_epi32(a, b);
#else
__m128 s = _mm_castsi128_ps(_mm_add_epi32(_mm_slli_epi32(b, 23), _mm_castps_si128(_mm_set1_ps(1.0f))));
return _mm_mullo_epi32(a, _mm_cvttps_epi32(s));
#endif
}

Jan Wassenberg · Answer 11 · Fri Apr 28 2023 23:49:40 GMT+0800 (China Standard Time)

Nice. 32 lanes/AVX2 sounds interesting, I'm curious how that goes.

Rich Geldreich · Answer 12 · Fri Apr 28 2023 23:59:38 GMT+0800 (China Standard Time)

Got pure SSE 4.1 Huffman decoding to 950 MiB/sec. SSE 4.1+1 AVX 2 instruction is at 1254 MiB/sec. I've got all the essentials in place to try AVX 2 next. The fastest scalar Huffman decoders I'm aware of get around 600-800 MiB/sec.

Rich Geldreich · Answer 13 · Sat Apr 29 2023 06:27:40 GMT+0800 (China Standard Time)

I had to switch to gather (_mm256_i32gather_epi32) to get AVX2 Huffman decoding to scale. Otherwise it was stuck at 950 MiB sec. Now it's ~1780 MiB/sec. This is probably necessary for range decoding too, because apart from the divs the inner loops are similar.

Rich Geldreich · Answer 14 · Sat Apr 29 2023 09:34:30 GMT+0800 (China Standard Time)

I'm now up to 1155 MiB/sec. decoding using purely SSE 4.1 (i.e. no more _mm_sllv_epi32). I switched the 13-bit Huffman lookup table to 32-bits and put the left shift multipliers (2^code_size) into the high 16-bit words of each entry, which is faster than computing the shift muls using the FP unit.

Rich Geldreich · Answer 15 · Sun Apr 30 2023 04:59:04 GMT+0800 (China Standard Time)

Got SSE 4.1 Range Decoding on Ice Lake to 738 MiB/sec. by switching to 32 lanes and more efficient writing. Attempting to get it to AVX-2.

Fast scalar Huffman on the same file peaks out at around 840 MiB/sec.

Rich Geldreich · Answer 16 · Sun Apr 30 2023 05:48:52 GMT+0800 (China Standard Time)

First attempt at AVX-2 Range Decoding gets 1008 MiB/sec. Not scaling as well as I would expect, but it's a start.

Rich Geldreich · Answer 17 · Sun Apr 30 2023 06:42:11 GMT+0800 (China Standard Time)

Bumping both SSE4.1/AVX2 up to 64 lanes and further optimizing the AVX2 normalize routine for range decoding:

SSE 4.1: 733 MiB/sec.
AVX 2: 1203 MiB/sec.

So AVX-2 is around ~1.6x faster. I'm sure an experienced AVX2 coder could do better. I can't open source the AVX2 code yet, but it's a fairly straightforward port. The only tricky part is the normalization function - you can't expand the 256 entry shuffle table in there (which is used to distribute the proper # of source bytes into each value lane) to 64k entries. Instead I had to sample the shuffle table twice and combine, and shift left the "length" variable using _mm256_sllv_epi32 vs. using shuffles.

Jan Wassenberg · Answer 18 · Tue May 02 2023 15:46:39 GMT+0800 (China Standard Time)

Steady improvements, nice :)
Could be interesting to run the code through llvm-mca to guesstimate the bottlenecks.

Rich Geldreich · Answer 19 · Sat May 13 2023 05:00:32 GMT+0800 (China Standard Time)

Just optimized the AVX-2 implementation more, further vectorizing the normalization. It's now 1.86x faster vs. SSE 4.1 on Ice Lake. I'm sure it can be pushed further with more tweaks.

Jan Wassenberg · Answer 20 · Mon May 15 2023 13:55:02 GMT+0800 (China Standard Time)

Nice! That's quite a speedup, more than we often see from AVX2 (unless FMA is involved).
I'm curious whether you also plan to open source that?