richgel999 / sserangecoding

Fast vectorized (SSE 4.1) range coder for 8-bit alphabets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cvttps and AVX2?

jan-wassenberg opened this issue · comments

Howdy Rich, very nice :)
Have you considered _mm_cvttps_epi32? I believe that has the truncation built in, so no need to change the rounding mode.

Also, do you think it would help to use AVX2 and its gather instruction for the table lookups? That would avoid the extract+insert.

If you'd prefer to avoid writing two copies of the code, I'd be happy to help with porting to github.com/google/highway.

Have you considered _mm_cvttps_epi32? I believe that has the truncation built in, so no need to change the rounding mode.

Thanks - that's a good idea. I overlooked that one. The only place that requires the rounding mode to be set to truncation is here, for the _mm_cvtps_epi32():
__m128i q = _mm_cvtps_epi32(_mm_div_ps(_mm_cvtepi32_ps(arith_value), _mm_cvtepi32_ps(r)));

I'll fix the code and give you credit for the suggestion. Thanks! (I almost gave up several times trying to get this to work and be efficient, so the rounding mode thing is an artifact of that process of figuring this out.)

I tried using AVX2 gathers - on Ice Lake they are slower than the extracts/inserts. This seems to be the consensus after some searching. It should be possible to use AVX2 to do 32 streams, though.

I've checked in this improvement.

If you'd prefer to avoid writing two copies of the code, I'd be happy to help with porting to github.com/google/highway.

I'm not worried about 2 copies. Relatively little effort has been invested into vectorizing range coding, and this repo is just an example of what's possible.

I just checked in an efficiency fix to test.cpp. The 'c' command could generate files which were too large if the input file size was >65535 bytes. This didn't impact the lower-level codec itself or the test mode, and the resulting compressed file was still correct and decompressed correctly.

Nice, glad to see you incorporated cvttps :)

I tried using AVX2 gathers - on Ice Lake they are slower than the extracts/inserts.

Interesting, that's a surprise. One potential confounder is that clang, with AVX-512 build flags at least,
actually compiles the extract+insert into a gather! https://gcc.godbolt.org/z/5j4b3z64n

It should be possible to use AVX2 to do 32 streams, though.

Yes, I figure that would be a sizeable win. 256-bit vdivps has the same latency as 128-bit on SKX, so we'd be bringing more hardware to bear.
(By contrast, the 512-bit div on SKX appears to be double-pumping that same hardware.)

Interleaved Huffman using SSE4.1 with 2 key AVX2 instructions - around 1060 MiB/sec. on 8-bit alphabets (book1). Uses 13-bit max code size and a 16KB table.

Unfortunately it needs _mm_srlv_epi32 and _mm_sllv_epi32. Emulating them with pure SSE4.1 will be costly and possibly not worth the effort vs. optimized scalar Huffman. Nearly everything about SSE/AVX fights to defeat you.

image

Worked around the problem by reversing the Huffman codes (JPEG-style, not zlib-style). So all the variable lane shifts are always left, not right/left. On SSE4.1 it gets ~830MiB/sec., AVX2 ~1070 MiB/sec. This should also scale well to 32 lanes vs. 16 lanes using entirely AVX2.

image

I've added benchmarks against ryg_rans and Collet's FSE benchmark - they are quite interesting:
https://github.com/richgel999/sserangecoding/blob/main/README.md

ryg_rans gets the smallest file, but it's not performing as well as I would expect. On this machine (Ice Lake) interleaved range coding is faster.

Congrats on the speed win!

Emulating them with pure SSE4.1 will be costly and possibly not worth the effort vs. optimized scalar Huffman. Nearly everything about SSE/AVX fights to defeat you.

Hah, indeed. Highway is a shield that protects us from most of the unpleasantness, and filling in such gaps.
Here's our implementation of sllv in case it's useful: https://github.com/google/highway/blob/master/hwy/ops/x86_128-inl.h#L5597

Thanks - looks like the Highway code is very similar to what I'm doing to emulate sllv. I have examined both the range and Huffman decoders and I don't see any reason why they can't scale to 32 lanes using AVX2. That's next.

#define USE_AVX2 1
static __forceinline __m128i shift_left(__m128i a, __m128i b)
{
#if USE_AVX2
return _mm_sllv_epi32(a, b);
#else
__m128 s = _mm_castsi128_ps(_mm_add_epi32(_mm_slli_epi32(b, 23), _mm_castps_si128(_mm_set1_ps(1.0f))));
return _mm_mullo_epi32(a, _mm_cvttps_epi32(s));
#endif
}

Nice. 32 lanes/AVX2 sounds interesting, I'm curious how that goes.

Got pure SSE 4.1 Huffman decoding to 950 MiB/sec. SSE 4.1+1 AVX 2 instruction is at 1254 MiB/sec. I've got all the essentials in place to try AVX 2 next. The fastest scalar Huffman decoders I'm aware of get around 600-800 MiB/sec.

I had to switch to gather (_mm256_i32gather_epi32) to get AVX2 Huffman decoding to scale. Otherwise it was stuck at 950 MiB sec. Now it's ~1780 MiB/sec. This is probably necessary for range decoding too, because apart from the divs the inner loops are similar.

image

I'm now up to 1155 MiB/sec. decoding using purely SSE 4.1 (i.e. no more _mm_sllv_epi32). I switched the 13-bit Huffman lookup table to 32-bits and put the left shift multipliers (2^code_size) into the high 16-bit words of each entry, which is faster than computing the shift muls using the FP unit.

image

Got SSE 4.1 Range Decoding on Ice Lake to 738 MiB/sec. by switching to 32 lanes and more efficient writing. Attempting to get it to AVX-2.

Fast scalar Huffman on the same file peaks out at around 840 MiB/sec.

First attempt at AVX-2 Range Decoding gets 1008 MiB/sec. Not scaling as well as I would expect, but it's a start.

image

Bumping both SSE4.1/AVX2 up to 64 lanes and further optimizing the AVX2 normalize routine for range decoding:

SSE 4.1: 733 MiB/sec.
AVX 2: 1203 MiB/sec.

So AVX-2 is around ~1.6x faster. I'm sure an experienced AVX2 coder could do better. I can't open source the AVX2 code yet, but it's a fairly straightforward port. The only tricky part is the normalization function - you can't expand the 256 entry shuffle table in there (which is used to distribute the proper # of source bytes into each value lane) to 64k entries. Instead I had to sample the shuffle table twice and combine, and shift left the "length" variable using _mm256_sllv_epi32 vs. using shuffles.

Steady improvements, nice :)
Could be interesting to run the code through llvm-mca to guesstimate the bottlenecks.

Just optimized the AVX-2 implementation more, further vectorizing the normalization. It's now 1.86x faster vs. SSE 4.1 on Ice Lake. I'm sure it can be pushed further with more tweaks.

Nice! That's quite a speedup, more than we often see from AVX2 (unless FMA is involved).
I'm curious whether you also plan to open source that?