NLnetLabs / simdzone

Fast and standards compliant DNS zone parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vectorize base32 decoding

k0ekk0ek opened this issue · comments

Base32 encoding is used in NSEC3 records (RFC5155) to hash the next owner name and may appear quite a lot in DNSSEC signed zones (not the case for .se, but is the case for e.g. .com). Like base16 and base64, this encoding may be eligible for vectorization too.

Wojciech Muła and Daniel Lemire wrote the paper "Faster Base64 Encoding and Decoding Using AVX2 Instructions". Some of the logic can be used to decode Base32 too. A first stab may look something like this:

#define s0to9 (-128 + '0')
#define sAtoV (-128 + 'A' + 9)
#define satov (-128 + 'a' + 9)

#define r0to9 (-128 + ('9' - '0' + 1))
#define rAtoV (-128 + ('V' - 'A' + (1 + 9)))
#define ratov (-128 + ('v' - 'a' + (1 + 9)))

static void decode(const char *encoded)
{
  __m256i input = _mm256_loadu_si256((const __m256i*)encoded);

  __m256i subtract = setr_epi8(
    0, 0, 0, s0to9, sAtoV, sAtoV, satov, satov,
    0, 0, 0, 0, 0, 0, 0, 0, 0);

  __m256i compare = setr_epi8(
    -128, -128, -128, r0to9, sAtoV, sAtoV, satov, satov,
    -128, -128, -128, -128, -128, -128, -128, -128);

  // determine shuffle mask based on hi nibbles
  __m256i nibbles = _mm256_srli_epi32(input, 4);
  nibbles = _mm256_and_si256(nibbles, _mm256_set1_epi8(0x0f));

  subtract = _mm256_shuffle_epi8(subtract, nibbles);
  compare = _mm256_shuffle_epi8(compare, nibbles);

  __m256i d = _mm256_sub_epi8(input, subtract); 
  const __m256i mask = _mm_cmplt_epi8(digits, compare); 

  d = _mm256_and_si256(d, mask);

  // (pretend the above works flawlessly)

  // left shift can be expressed as multiply by power of two
  //  
  // SHL    MUL(16)    MUL(10)
  //  0  :  0x0001  :     1
  //  1  :  0x0002  :     2
  //  2  :  0x0004  :     4
  //  3  :  0x0008  :     8
  //  4  :  0x0010  :    16
  //  5  :  0x0020  :    32
  //  6  :  0x0040  :    64
  //  7  :  0x0080  :   128
  //  8  :  0x0100  :   256
  //  9  :  0x0200  :   512
  // 10  :  0x0400  :  1024

  // pack into 10-bit words
  const __m128i d16 = _mm_addubs_epi16(d, _mm_set1_epi32(0x01080108));

  // pack into 20-bit words
  const __m128i d32 = _mm_madd_epi16(d16, _mm_set1_epi32(0x00010400));

  //  
  // base64 decoder now uses a shuffle, but that is not possible for base32 as
  // byte boundaries span epi32 (only 20-bit words)
  //  
  // shifting individual epi32 into location (_mm_slli_epi64) and adding them
  // afterward will get two 40-bit words, which means proper byte boundaries.
  //  

  // implement
}

However, with Base32 after packing up to epi32, we have 20-bit words. As such, the shuffle operation cannot be done right away because of the byte boundary (we miss 4 bits). Of course, this can be resolved by shifting and then adding two epi32 together which gets us 2 40-bit words, where the byte boundary is convenient again.

Same question here: are spaces allowed within the codes (e.g., line returns).

No, Base32 is only used in NSEC3 (presentation format). The data there is not the last field and hence must be presented as one contiguous set of characters.

Example from Appendix A in RFC5155:

  NSEC3   1 1 12 aabbccdd (
                            2vptu5timamqttgl4luu9kg21e0aor3s A RRSIG )
commented

pretend the above works flawlessly

The input range can be hashed/sliced into spans of 8 instead spans of 16.
However, that necessitates a fixup for signed chars.

const __m128i delta_check = _mm_setr_epi8(-128, -128, -128, -128, -128, -128, -50, -128, -65, 0, 0, 0x7F - 0x5A, -97, 0, 0, 0x7F - 0x7A);
const __m128i delta_rebase = _mm_setr_epi8(0, 0, 0, 0, 0, 0, -24, 0, -65, -65, -65, -65, -97, -97, -97, -97);

__m128i v = _mm_loadu_si128((__m128i*)src);
__m128i hash_key = _mm_and_si128(_mm_srli_epi32(v, 3), _mm_set1_epi8(0x0F));

// unpacked binary-coded-duotrigesimal
__m128i data = _mm_add_epi8(_mm_shuffle_epi8(delta_rebase, hash_key), v);

// bad char check
__m128i check = _mm_add_epi8(_mm_shuffle_epi8(delta_check, hash_key), v);
check = _mm_or_si128(check, v); // fixup signed chars
if(_mm_movemask_epi8(check)) {
	; // bad char detected
}

// todo: pack bits

edit: it looks like "base32hex" is required but the same approach would still apply.
Though, maybe a better job can be done with the different characters.

I will have a look this week.

commented

Should we know how many base32hex digits to expect before starting decoding?

Anyways, decoding seems fairly uninteresting.
Here is a bare kernel:
https://gist.github.com/aqrit/ee41f2fe222b3cb68aaf7facd5cd08a6

Thank you @aqrit! Looks pretty straightforward indeed. I'll play with it after the weekend and go from there. Much appreciated! @lemire, are you interested in reviewing and benchmarking too? The current code is terrible, so that might make for a fun write-up as well? 🙂

Yes. It is on my todo as well.

@aqrit, sorry, totally forgot to answer your question. We can know the base32hex digits beforehand. Currently, it's only used for NSEC3 (RFC5155) which means the encoded string is a fixed size hash. However, new algorithms may be added, so I don't know if it's wise to do anything with it beforehand. I think it's better to just decode and require the proper amount of padding. Then verify it's the correct length for known algorithms afterwards?

Starting to look at your code now btw.

Only looked at the simd variant (yet), but nice work @aqrit! Should make for a nice speedup (.com zone file contains it). Only thing we have to add is checking for correct padding, which is easy to add (especially because we don't have to account for whitespace). Thanks!

I especially like your range check, pretty clever.

(of course, I'll look at the swar version for the fallback parser too)

Ok. Let me build this up a big and see what we can make out of it... :-)

I hope to have something later today.

Ok. So in my tests, SWAR is effectively useless. The scalar approach is faster. I have even have a slightly faster one (base32hex_fast) that can reach 2.5 GB/s (about 10% faster than the routine/scalar approach). The SIMD approach is about 2.5 times faster in my tests. You can gain a bit of speed by using 256-bit vectors (AVX) but not that much. All the functions can slightly overwrite to the output buffer and overread in the input buffer.

For my benchmark, I use short inputs ("F1S6QOJADHQMKS3GCLIN4RB9F1Q6UT37") and there is no inlining, which means that each call has to pay for register initialization and all that, which explains the lack of power of the wide SIMD approach.

base32hex_avx                  :   7.55 GB/s  235.9 Ma/s   4.24 ns/d   3.23 GHz  13.70 c/d  61.21 i/d    0.4 c/b   1.91 i/b   4.47 i/c 
base32hex_simd                 :   7.02 GB/s  219.3 Ma/s   4.56 ns/d   3.23 GHz  14.72 c/d  70.21 i/d    0.5 c/b   2.19 i/b   4.77 i/c 
base32hex_fast                 :   2.53 GB/s   79.0 Ma/s  12.65 ns/d   3.21 GHz  40.57 c/d  194.21 i/d    1.3 c/b   6.07 i/b   4.79 i/c 
base32hex_simple               :   2.35 GB/s   73.4 Ma/s  13.62 ns/d   3.21 GHz  43.69 c/d  231.21 i/d    1.4 c/b   7.23 i/b   5.29 i/c 
base32hex_swar                 :   1.98 GB/s   61.7 Ma/s  16.20 ns/d   3.20 GHz  51.85 c/d  230.21 i/d    1.6 c/b   7.19 i/b   4.44 i/c 

As before, I have made my code available...
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/07/19

They are all maxing out the "instructions per cycle". So it is an instance where lowering the number of instructions is critical. Sadly, this might be a tad difficult for short inputs.

If you know that the size of the input (@aqrit raised this point), you can gain a bit of speed. I suspect it is not that much...

Good stuff @lemire. Good to have this documented, don't know if this is used in many other places, but people can at least find it on the internet now.

Can you include this version too? It doesn't make sense from a performance perspective, but it nicely shows where we started from(?)

We also have to check for padding(?) We can use the zero_mask trick but with equals signs instead. Modulo with 8, pick the right mask and compare?

We also have to check for padding(?)

I followed @aqrit's idea and we don't really validate the padding. Basically, you can stop the stream and put any garbage (not just =) and it will work. (It should work.)

But sure, we can try to check that we have =, as you describe.

commented

Base32: the "Base 32 Encoding with Extended Hex Alphabet" as
specified in [RFC4648]. Note that trailing padding characters
("=") are not used in the NSEC3 specification.

The Next Hashed Owner Name field is represented as an unpadded
sequence of case-insensitive base32 digits, without whitespace.

Is support for padding needed?

base32hex_fast

IMO, it is not worth pulling that much data into the cache.
However, the data in the table could be bswapped to eliminate the bswap in the code.
Bad_char entries could be a single set bit in the hi_byte, then only one branch for bad chars would be required..
though we'd still need to zero the bytes after the first bad char... because some bytes are made from 3 input chars.

@aqrit

IMO, it is not worth pulling that much data into the cache.

I agree. This is research.

I started implementing the padding check, and almost got it done, but after reading @aqrit 's comment, I am pulling out.

Because the functions return the number of bytes read, it is always easy to check the padding if you want to. All you need is a single loop when the number of bytes read is not a multiple of eight.

@aqrit I have implemented your optimizations but it hardly makes a difference performance-wise. :-)

However the code is prettier.

Ok. Here is what I get...

base32hex_avx                  :   8.02 GB/s  250.6 Ma/s   3.99 ns/d   3.24 GHz  12.91 c/d  61.21 i/d    0.4 c/b   1.91 i/b   4.74 i/c 
base32hex_simd                 :   6.95 GB/s  217.2 Ma/s   4.60 ns/d   3.23 GHz  14.86 c/d  70.21 i/d    0.5 c/b   2.19 i/b   4.72 i/c 
base32hex_fast                 :   2.54 GB/s   79.4 Ma/s  12.60 ns/d   3.21 GHz  40.39 c/d  119.21 i/d    1.3 c/b   3.73 i/b   2.95 i/c 
base32hex_simple               :   2.34 GB/s   73.0 Ma/s  13.70 ns/d   3.20 GHz  43.90 c/d  231.21 i/d    1.4 c/b   7.23 i/b   5.27 i/c 
base32hex_swar                 :   1.99 GB/s   62.2 Ma/s  16.08 ns/d   3.20 GHz  51.51 c/d  230.21 i/d    1.6 c/b   7.19 i/b   4.47 i/c 
b32_pton                       :   0.21 GB/s    6.5 Ma/s  153.07 ns/d   3.19 GHz  488.84 c/d  1429.90 i/d   15.3 c/b  44.68 i/b   2.93 i/c 

I am moving the URL to Code-used-on-Daniel-Lemire-s-blog/2023/07/20 from Code-used-on-Daniel-Lemire-s-blog/2023/07/20 and pushing a blog post.

I started implementing the padding check, and almost got it done, but after reading @aqrit 's comment, I am pulling out.

Because the functions return the number of bytes read, it is always easy to check the padding if you want to. All you need is a single loop when the number of bytes read is not a multiple of eight.

Apparently, we don't even want to support padding 😅. RFC5155 states:

The Next Hashed Owner Name field is represented as an unpadded sequence of case-insensitive base32 digits, without whitespace.

And indeed, the .com zones I have laying around contain no padding. So in the PR, checking if the field is followed by a character not in the contiguous set is sufficient.

That's some great numbers @lemire! Thank you both!

The current code would work even if there is padding. It would simply be lenient about it (e.g., it would accept any character as padding, not just '=').

However, I should note that adding such a check (that the padding is done with '=') would be cheap... I considered adding it as an option (check_padding=true/false), but why make the function more complicated than it needs to be?

The code in NSD doesn't account for anything but valid characters and that code has been in use for a long time. I think it's better not to allow for padding and strictly follow the RFC. If there's a valid use case we can always add it afterwards(?)

@k0ekk0ek I think that's what the code we have written does. What I mean by lenient is that as soon as invalid characters are found, you consider that you have terminated the base32 sequence.

Of course, you could have something like....

F1S6QOJADHQMKS3GCLIN4RB9F1Q6UT37+8***fds

What is would do is that it would stop at the +.

Sounds good to me 👍

@lemire, are you working on a PR? No pressure, just want to know if you want to do the honors or if you want me to integrate the changes 🙂

I can prepare a PR, yes.

PR available.