Vectorize base32 decoding

Question

Vectorize base32 decoding

k0ekk0ek opened this issue a year ago · comments

Base32 encoding is used in NSEC3 records (RFC5155) to hash the next owner name and may appear quite a lot in DNSSEC signed zones (not the case for .se, but is the case for e.g. .com). Like base16 and base64, this encoding may be eligible for vectorization too.

Jeroen Koekkoek · Answer 1 · Tue Jun 20 2023 17:27:21 GMT+0800 (China Standard Time)

Wojciech Muła and Daniel Lemire wrote the paper "Faster Base64 Encoding and Decoding Using AVX2 Instructions". Some of the logic can be used to decode Base32 too. A first stab may look something like this:

#define s0to9 (-128 + '0')
#define sAtoV (-128 + 'A' + 9)
#define satov (-128 + 'a' + 9)

#define r0to9 (-128 + ('9' - '0' + 1))
#define rAtoV (-128 + ('V' - 'A' + (1 + 9)))
#define ratov (-128 + ('v' - 'a' + (1 + 9)))

static void decode(const char *encoded)
{
  __m256i input = _mm256_loadu_si256((const __m256i*)encoded);

  __m256i subtract = setr_epi8(
    0, 0, 0, s0to9, sAtoV, sAtoV, satov, satov,
    0, 0, 0, 0, 0, 0, 0, 0, 0);

  __m256i compare = setr_epi8(
    -128, -128, -128, r0to9, sAtoV, sAtoV, satov, satov,
    -128, -128, -128, -128, -128, -128, -128, -128);

  // determine shuffle mask based on hi nibbles
  __m256i nibbles = _mm256_srli_epi32(input, 4);
  nibbles = _mm256_and_si256(nibbles, _mm256_set1_epi8(0x0f));

  subtract = _mm256_shuffle_epi8(subtract, nibbles);
  compare = _mm256_shuffle_epi8(compare, nibbles);

  __m256i d = _mm256_sub_epi8(input, subtract); 
  const __m256i mask = _mm_cmplt_epi8(digits, compare); 

  d = _mm256_and_si256(d, mask);

  // (pretend the above works flawlessly)

  // left shift can be expressed as multiply by power of two
  //  
  // SHL    MUL(16)    MUL(10)
  //  0  :  0x0001  :     1
  //  1  :  0x0002  :     2
  //  2  :  0x0004  :     4
  //  3  :  0x0008  :     8
  //  4  :  0x0010  :    16
  //  5  :  0x0020  :    32
  //  6  :  0x0040  :    64
  //  7  :  0x0080  :   128
  //  8  :  0x0100  :   256
  //  9  :  0x0200  :   512
  // 10  :  0x0400  :  1024

  // pack into 10-bit words
  const __m128i d16 = _mm_addubs_epi16(d, _mm_set1_epi32(0x01080108));

  // pack into 20-bit words
  const __m128i d32 = _mm_madd_epi16(d16, _mm_set1_epi32(0x00010400));

  //  
  // base64 decoder now uses a shuffle, but that is not possible for base32 as
  // byte boundaries span epi32 (only 20-bit words)
  //  
  // shifting individual epi32 into location (_mm_slli_epi64) and adding them
  // afterward will get two 40-bit words, which means proper byte boundaries.
  //  

  // implement
}

However, with Base32 after packing up to epi32, we have 20-bit words. As such, the shuffle operation cannot be done right away because of the byte boundary (we miss 4 bits). Of course, this can be resolved by shifting and then adding two epi32 together which gets us 2 40-bit words, where the byte boundary is convenient again.

Daniel Lemire · Answer 2 · Tue Jun 20 2023 22:39:45 GMT+0800 (China Standard Time)

Same question here: are spaces allowed within the codes (e.g., line returns).

Jeroen Koekkoek · Answer 3 · Tue Jun 20 2023 22:56:36 GMT+0800 (China Standard Time)

No, Base32 is only used in NSEC3 (presentation format). The data there is not the last field and hence must be presented as one contiguous set of characters.

Example from Appendix A in RFC5155:

  NSEC3   1 1 12 aabbccdd (
                            2vptu5timamqttgl4luu9kg21e0aor3s A RRSIG )

aqrit · Answer 4 · Sat Jun 24 2023 19:41:22 GMT+0800 (China Standard Time)

pretend the above works flawlessly

The input range can be hashed/sliced into spans of 8 instead spans of 16.
However, that necessitates a fixup for signed chars.

const __m128i delta_check = _mm_setr_epi8(-128, -128, -128, -128, -128, -128, -50, -128, -65, 0, 0, 0x7F - 0x5A, -97, 0, 0, 0x7F - 0x7A);
const __m128i delta_rebase = _mm_setr_epi8(0, 0, 0, 0, 0, 0, -24, 0, -65, -65, -65, -65, -97, -97, -97, -97);

__m128i v = _mm_loadu_si128((__m128i*)src);
__m128i hash_key = _mm_and_si128(_mm_srli_epi32(v, 3), _mm_set1_epi8(0x0F));

// unpacked binary-coded-duotrigesimal
__m128i data = _mm_add_epi8(_mm_shuffle_epi8(delta_rebase, hash_key), v);

// bad char check
__m128i check = _mm_add_epi8(_mm_shuffle_epi8(delta_check, hash_key), v);
check = _mm_or_si128(check, v); // fixup signed chars
if(_mm_movemask_epi8(check)) {
	; // bad char detected
}

// todo: pack bits

edit: it looks like "base32hex" is required but the same approach would still apply.
Though, maybe a better job can be done with the different characters.

Daniel Lemire · Answer 5 · Mon Jun 26 2023 00:22:59 GMT+0800 (China Standard Time)

I will have a look this week.

aqrit · Answer 6 · Fri Jul 14 2023 06:56:48 GMT+0800 (China Standard Time)

Should we know how many base32hex digits to expect before starting decoding?

Anyways, decoding seems fairly uninteresting.
Here is a bare kernel:
https://gist.github.com/aqrit/ee41f2fe222b3cb68aaf7facd5cd08a6

Jeroen Koekkoek · Answer 7 · Fri Jul 14 2023 16:22:41 GMT+0800 (China Standard Time)

Thank you @aqrit! Looks pretty straightforward indeed. I'll play with it after the weekend and go from there. Much appreciated! @lemire, are you interested in reviewing and benchmarking too? The current code is terrible, so that might make for a fun write-up as well? 🙂

Daniel Lemire · Answer 8 · Fri Jul 14 2023 21:20:27 GMT+0800 (China Standard Time)

Yes. It is on my todo as well.

Jeroen Koekkoek · Answer 9 · Wed Jul 19 2023 23:59:12 GMT+0800 (China Standard Time)

@aqrit, sorry, totally forgot to answer your question. We can know the base32hex digits beforehand. Currently, it's only used for NSEC3 (RFC5155) which means the encoded string is a fixed size hash. However, new algorithms may be added, so I don't know if it's wise to do anything with it beforehand. I think it's better to just decode and require the proper amount of padding. Then verify it's the correct length for known algorithms afterwards?

Starting to look at your code now btw.

Jeroen Koekkoek · Answer 10 · Thu Jul 20 2023 01:19:33 GMT+0800 (China Standard Time)

Only looked at the simd variant (yet), but nice work @aqrit! Should make for a nice speedup (.com zone file contains it). Only thing we have to add is checking for correct padding, which is easy to add (especially because we don't have to account for whitespace). Thanks!

I especially like your range check, pretty clever.

(of course, I'll look at the swar version for the fallback parser too)

Daniel Lemire · Answer 11 · Thu Jul 20 2023 02:02:25 GMT+0800 (China Standard Time)

Ok. Let me build this up a big and see what we can make out of it... :-)

Daniel Lemire · Answer 12 · Thu Jul 20 2023 02:40:02 GMT+0800 (China Standard Time)

I hope to have something later today.

Daniel Lemire · Answer 13 · Thu Jul 20 2023 08:58:36 GMT+0800 (China Standard Time)

Ok. So in my tests, SWAR is effectively useless. The scalar approach is faster. I have even have a slightly faster one (base32hex_fast) that can reach 2.5 GB/s (about 10% faster than the routine/scalar approach). The SIMD approach is about 2.5 times faster in my tests. You can gain a bit of speed by using 256-bit vectors (AVX) but not that much. All the functions can slightly overwrite to the output buffer and overread in the input buffer.

For my benchmark, I use short inputs ("F1S6QOJADHQMKS3GCLIN4RB9F1Q6UT37") and there is no inlining, which means that each call has to pay for register initialization and all that, which explains the lack of power of the wide SIMD approach.

base32hex_avx                  :   7.55 GB/s  235.9 Ma/s   4.24 ns/d   3.23 GHz  13.70 c/d  61.21 i/d    0.4 c/b   1.91 i/b   4.47 i/c 
base32hex_simd                 :   7.02 GB/s  219.3 Ma/s   4.56 ns/d   3.23 GHz  14.72 c/d  70.21 i/d    0.5 c/b   2.19 i/b   4.77 i/c 
base32hex_fast                 :   2.53 GB/s   79.0 Ma/s  12.65 ns/d   3.21 GHz  40.57 c/d  194.21 i/d    1.3 c/b   6.07 i/b   4.79 i/c 
base32hex_simple               :   2.35 GB/s   73.4 Ma/s  13.62 ns/d   3.21 GHz  43.69 c/d  231.21 i/d    1.4 c/b   7.23 i/b   5.29 i/c 
base32hex_swar                 :   1.98 GB/s   61.7 Ma/s  16.20 ns/d   3.20 GHz  51.85 c/d  230.21 i/d    1.6 c/b   7.19 i/b   4.44 i/c

As before, I have made my code available...
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/07/19

They are all maxing out the "instructions per cycle". So it is an instance where lowering the number of instructions is critical. Sadly, this might be a tad difficult for short inputs.

If you know that the size of the input (@aqrit raised this point), you can gain a bit of speed. I suspect it is not that much...

Jeroen Koekkoek · Answer 14 · Thu Jul 20 2023 13:59:06 GMT+0800 (China Standard Time)

Good stuff @lemire. Good to have this documented, don't know if this is used in many other places, but people can at least find it on the internet now.

Can you include this version too? It doesn't make sense from a performance perspective, but it nicely shows where we started from(?)

We also have to check for padding(?) We can use the zero_mask trick but with equals signs instead. Modulo with 8, pick the right mask and compare?

Daniel Lemire · Answer 15 · Thu Jul 20 2023 23:30:01 GMT+0800 (China Standard Time)

We also have to check for padding(?)

I followed @aqrit's idea and we don't really validate the padding. Basically, you can stop the stream and put any garbage (not just =) and it will work. (It should work.)

But sure, we can try to check that we have =, as you describe.

aqrit · Answer 16 · Fri Jul 21 2023 03:56:44 GMT+0800 (China Standard Time)

Base32: the "Base 32 Encoding with Extended Hex Alphabet" as
specified in [RFC4648]. Note that trailing padding characters
("=") are not used in the NSEC3 specification.

The Next Hashed Owner Name field is represented as an unpadded
sequence of case-insensitive base32 digits, without whitespace.

Is support for padding needed?

base32hex_fast

IMO, it is not worth pulling that much data into the cache.
However, the data in the table could be bswapped to eliminate the bswap in the code.
Bad_char entries could be a single set bit in the hi_byte, then only one branch for bad chars would be required..
though we'd still need to zero the bytes after the first bad char... because some bytes are made from 3 input chars.

Daniel Lemire · Answer 17 · Fri Jul 21 2023 04:32:46 GMT+0800 (China Standard Time)

@aqrit

IMO, it is not worth pulling that much data into the cache.

I agree. This is research.

Daniel Lemire · Answer 18 · Fri Jul 21 2023 04:38:33 GMT+0800 (China Standard Time)

I started implementing the padding check, and almost got it done, but after reading @aqrit 's comment, I am pulling out.

Because the functions return the number of bytes read, it is always easy to check the padding if you want to. All you need is a single loop when the number of bytes read is not a multiple of eight.

Daniel Lemire · Answer 19 · Fri Jul 21 2023 05:37:17 GMT+0800 (China Standard Time)

@aqrit I have implemented your optimizations but it hardly makes a difference performance-wise. :-)

However the code is prettier.

Daniel Lemire · Answer 20 · Fri Jul 21 2023 05:41:52 GMT+0800 (China Standard Time)

Ok. Here is what I get...

base32hex_avx                  :   8.02 GB/s  250.6 Ma/s   3.99 ns/d   3.24 GHz  12.91 c/d  61.21 i/d    0.4 c/b   1.91 i/b   4.74 i/c 
base32hex_simd                 :   6.95 GB/s  217.2 Ma/s   4.60 ns/d   3.23 GHz  14.86 c/d  70.21 i/d    0.5 c/b   2.19 i/b   4.72 i/c 
base32hex_fast                 :   2.54 GB/s   79.4 Ma/s  12.60 ns/d   3.21 GHz  40.39 c/d  119.21 i/d    1.3 c/b   3.73 i/b   2.95 i/c 
base32hex_simple               :   2.34 GB/s   73.0 Ma/s  13.70 ns/d   3.20 GHz  43.90 c/d  231.21 i/d    1.4 c/b   7.23 i/b   5.27 i/c 
base32hex_swar                 :   1.99 GB/s   62.2 Ma/s  16.08 ns/d   3.20 GHz  51.51 c/d  230.21 i/d    1.6 c/b   7.19 i/b   4.47 i/c 
b32_pton                       :   0.21 GB/s    6.5 Ma/s  153.07 ns/d   3.19 GHz  488.84 c/d  1429.90 i/d   15.3 c/b  44.68 i/b   2.93 i/c

I am moving the URL to Code-used-on-Daniel-Lemire-s-blog/2023/07/20 from Code-used-on-Daniel-Lemire-s-blog/2023/07/20 and pushing a blog post.

Daniel Lemire · Answer 21 · Fri Jul 21 2023 08:15:19 GMT+0800 (China Standard Time)

Blog post:

Fast decoding of base32 strings

Jeroen Koekkoek · Answer 22 · Fri Jul 21 2023 16:25:24 GMT+0800 (China Standard Time)

I started implementing the padding check, and almost got it done, but after reading @aqrit 's comment, I am pulling out.

Because the functions return the number of bytes read, it is always easy to check the padding if you want to. All you need is a single loop when the number of bytes read is not a multiple of eight.

Apparently, we don't even want to support padding 😅. RFC5155 states:

The Next Hashed Owner Name field is represented as an unpadded sequence of case-insensitive base32 digits, without whitespace.

And indeed, the .com zones I have laying around contain no padding. So in the PR, checking if the field is followed by a character not in the contiguous set is sufficient.

Jeroen Koekkoek · Answer 23 · Fri Jul 21 2023 16:26:07 GMT+0800 (China Standard Time)

That's some great numbers @lemire! Thank you both!

Daniel Lemire · Answer 24 · Sat Jul 22 2023 01:45:49 GMT+0800 (China Standard Time)

The current code would work even if there is padding. It would simply be lenient about it (e.g., it would accept any character as padding, not just '=').

However, I should note that adding such a check (that the padding is done with '=') would be cheap... I considered adding it as an option (check_padding=true/false), but why make the function more complicated than it needs to be?

Jeroen Koekkoek · Answer 25 · Sat Jul 22 2023 17:02:03 GMT+0800 (China Standard Time)

The code in NSD doesn't account for anything but valid characters and that code has been in use for a long time. I think it's better not to allow for padding and strictly follow the RFC. If there's a valid use case we can always add it afterwards(?)

Daniel Lemire · Answer 26 · Mon Jul 24 2023 02:14:07 GMT+0800 (China Standard Time)

@k0ekk0ek I think that's what the code we have written does. What I mean by lenient is that as soon as invalid characters are found, you consider that you have terminated the base32 sequence.

Of course, you could have something like....

F1S6QOJADHQMKS3GCLIN4RB9F1Q6UT37+8***fds

What is would do is that it would stop at the +.

Jeroen Koekkoek · Answer 27 · Mon Jul 24 2023 15:07:02 GMT+0800 (China Standard Time)

Sounds good to me 👍

Jeroen Koekkoek · Answer 28 · Tue Jul 25 2023 19:55:09 GMT+0800 (China Standard Time)

@lemire, are you working on a PR? No pressure, just want to know if you want to do the honors or if you want me to integrate the changes 🙂

Daniel Lemire · Answer 29 · Wed Jul 26 2023 00:27:41 GMT+0800 (China Standard Time)

I can prepare a PR, yes.

Daniel Lemire · Answer 30 · Wed Jul 26 2023 08:36:58 GMT+0800 (China Standard Time)

PR available.