NLnetLabs / simdzone

Fast and standards compliant DNS zone parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vectorize base16 (hex) decoding

k0ekk0ek opened this issue · comments

base16 encoding is used quite a lot. i.e. in DS records and to represent rdata for unknown record types (as outlined in RFC3597). Research was done to deserialize more efficiently by Geoff Langdale and Wojciech Muła. A couple of algorithms are outlined in this article. Other research may be available too.

commented

Base16 (hex) can be decoded similarly to base32hex, except first we’d v = _mm_add_epi8(v, _mm_set1_epi8(-1)) to make the spans line up. We could also get rid of both the multiply instructions.

But.. how many shuffles should be used? Haswell CPUs have only one shuffle execution port. An AVX2 port of the base32hex style decoder would require 4 shuffles per 32-bytes of input. The linked solution by Geoff Langdale uses 2 shuffles, has only 1 more instruction in total, but it would require several more 32-byte constants from memory.

Let us test it out!

For base16, we'd need to keep some state as some record types (and generic RDATA representation) allow for spaces. In terms of the zone parser, that means calling lex multiple times, so we'd have to stop at the first non-hexadecimal character.

keep some state as some record types

The standard seems to require that bytes are encoded as pairs and that pairs are held together, so that the input is always an even number of characters. Are you saying that they don't do that? That'd be very strange.

Using GCC11 on an Ice Lake server, I get the following results:

base16hex_simd                 :   2.78 GB/s  253.1 Ma/s   3.95 ns/d   3.20 GHz  12.63 c/d  44.49 i/d    1.2 c/b   4.06 i/b   3.52 i/c 
base16hex_simd_geoff           :   2.00 GB/s  182.6 Ma/s   5.48 ns/d   3.20 GHz  17.50 c/d  65.48 i/d    1.6 c/b   5.97 i/b   3.74 i/c 
base16hex_simple               :   0.85 GB/s   77.4 Ma/s  12.92 ns/d   3.19 GHz  41.27 c/d  85.65 i/d    3.8 c/b   7.81 i/b   2.08 i/c 
base16hex_simdzone_fallback    :   0.76 GB/s   69.6 Ma/s  14.38 ns/d   3.19 GHz  45.91 c/d  88.13 i/d    4.2 c/b   8.04 i/b   1.92 i/c 

My benchmark uses relatively short strings, of random length. We can (and maybe should) change that to more closely match what we have in the actual data. It is important to note that the bulk of the instructions are in the setting up for both base16hex_simd and base16hex_simd_geoff, so if we expect longer strings, then it would be important to know about it. Also if these functions are to be inlined, I should probably change the benchmark as it will make the simd code shine. I'd say my benchmark makes SIMD looks bad... but I prefer to be conservative.

In my table base16hex_simd_geoff is Geoff's version from the article @k0ekk0ek linked above. Whereas base16hex_simd is basically our base32 decoder simplified for base32. So we are significantly faster than Geoff's approach as far as I can tell.

It is possible we can improve our approach further. I did not quite understand @aqrit's description: I am still using a multiplication when combining the bytes. I can do it without a multiplication, but the approach I imagine require several extra instructions (no big deal, but slightly slower...). Maybe there is a clever way that I don't see.

The version in simdzone currently is base16hex_simdzone_fallback (or something close to it)...

Here is the assembly for our version (note that there is a hot loop in the middle):

base16hex_simd(unsigned char*, unsigned char const*):
        movdqu  xmm1, XMMWORD PTR [rsi]
        pcmpeqd xmm5, xmm5
        mov     rax, rsi
        movdqa  xmm7, XMMWORD PTR .LC1[rip]
        paddb   xmm1, xmm5
        movdqa  xmm6, XMMWORD PTR .LC0[rip]
        movdqa  xmm8, XMMWORD PTR .LC2[rip]
        movdqa  xmm2, xmm1
        movdqa  xmm3, xmm7
        psrld   xmm2, 4
        movdqa  xmm0, xmm8
        pand    xmm2, xmm6
        pshufb  xmm3, xmm2
        pshufb  xmm0, xmm2
        paddb   xmm0, xmm1
        paddb   xmm1, xmm3
        pmovmskb        edx, xmm1
        test    edx, edx
        jne     .L4
        movdqa  xmm4, XMMWORD PTR .LC3[rip]
        movdqa  xmm9, XMMWORD PTR .LC4[rip]
.L2:
        add     rax, 16
        pmaddubsw       xmm0, xmm4
        movdqa  xmm3, xmm7
        pshufb  xmm0, xmm9
        movups  XMMWORD PTR [rdi], xmm0
        movdqu  xmm1, XMMWORD PTR [rax]
        movdqa  xmm0, xmm8
        add     rdi, 8
        paddb   xmm1, xmm5
        movdqa  xmm2, xmm1
        psrld   xmm2, 4
        pand    xmm2, xmm6
        pshufb  xmm3, xmm2
        pshufb  xmm0, xmm2
        paddb   xmm0, xmm1
        paddb   xmm1, xmm3
        pmovmskb        edx, xmm1
        test    edx, edx
        je      .L2
.L4:
        bsf     rdx, rdx
        test    edx, edx
        je      .L3
        movsx   rdx, edx
        mov     ecx, 16
        sub     rcx, rdx
        add     rax, rdx
        movdqu  xmm1, XMMWORD PTR base16hex_simd(unsigned char*, unsigned char const*)::zero_masks[rcx]
        pandn   xmm1, xmm0
        movdqa  xmm0, XMMWORD PTR .LC3[rip]
        pmaddubsw       xmm1, xmm0
        pshufb  xmm1, XMMWORD PTR .LC4[rip]
        movups  XMMWORD PTR [rdi], xmm1
.L3:
        sub     rax, rsi
        ret

Here is the assembly for Geoff's version (note the hot loop in the middle):

base16hex_simd_geoff(unsigned char*, unsigned char const*):
        movdqa  xmm3, XMMWORD PTR .LC5[rip]
        mov     rax, rsi
        movdqu  xmm1, XMMWORD PTR [rsi]
        movdqa  xmm4, XMMWORD PTR .LC6[rip]
        movdqa  xmm0, xmm3
        movdqa  xmm5, XMMWORD PTR .LC7[rip]
        paddb   xmm0, xmm1
        movdqa  xmm6, XMMWORD PTR .LC8[rip]
        psubusb xmm0, xmm4
        movdqa  xmm8, XMMWORD PTR .LC9[rip]
        pand    xmm1, xmm5
        movdqa  xmm7, XMMWORD PTR .LC10[rip]
        paddb   xmm1, xmm6
        movdqa  xmm9, XMMWORD PTR .LC11[rip]
        paddusb xmm1, xmm8
        paddb   xmm0, xmm7
        pminub  xmm0, xmm1
        movdqa  xmm1, xmm0
        paddusb xmm0, xmm9
        pmovmskb        edx, xmm0
        test    edx, edx
        jne     .L10
        movdqa  xmm2, XMMWORD PTR .LC3[rip]
        movdqa  xmm10, XMMWORD PTR .LC4[rip]
.L8:
        add     rax, 16
        pmaddubsw       xmm1, xmm2
        movdqa  xmm0, xmm3
        pshufb  xmm1, xmm10
        movups  XMMWORD PTR [rdi], xmm1
        movdqu  xmm1, XMMWORD PTR [rax]
        add     rdi, 8
        paddb   xmm0, xmm1
        pand    xmm1, xmm5
        paddb   xmm1, xmm6
        psubusb xmm0, xmm4
        paddusb xmm1, xmm8
        paddb   xmm0, xmm7
        pminub  xmm0, xmm1
        movdqa  xmm1, xmm0
        paddusb xmm0, xmm9
        pmovmskb        edx, xmm0
        test    edx, edx
        je      .L8
.L10:
        bsf     rdx, rdx
        test    edx, edx
        je      .L9
        movsx   rdx, edx
        mov     ecx, 16
        sub     rcx, rdx
        add     rax, rdx
        movdqu  xmm0, XMMWORD PTR base16hex_simd_geoff(unsigned char*, unsigned char const*)::zero_masks[rcx]
        pandn   xmm0, xmm1
        movdqa  xmm1, XMMWORD PTR .LC3[rip]
        pmaddubsw       xmm0, xmm1
        pshufb  xmm0, XMMWORD PTR .LC4[rip]
        movups  XMMWORD PTR [rdi], xmm0
.L9:
        sub     rax, rsi
        ret

The source code is available at

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/07/26

The standard seems to require that bytes are encoded as pairs and that pairs are held together, so that the input is always an even number of characters. Are you saying that they don't do that? That'd be very strange.

Quite a few RR presentations don't require that pairs are held together. For instance DS is described in RFC 4034 as:

   The Digest MUST be represented as a sequence of case-insensitive
   hexadecimal digits.  Whitespace is allowed within the hexadecimal
   text.

If you know it is white space, then you can trim it out as a first pass. Super fast if AVX-512 is available, slower but still fast otherwise.

I can easily include that in the benchmark.

Again… what matters is whether I am benchmarking the right problem. We need to make sure that I do.

E.g., when I load the data, I can add a branch that checks for white space and prune it out dynamically as needed.

Looks good @lemire!

Note that for simdzone if there's white space we need to stop as for contiguous strings that's a delimiter and the next set of characters may not be complete.

As for @aqrit's comment, I think he means we need to subtract 1 as A-F starts from 0x41, not 0x40, so we can use his range check (please correct me if I'm wrong). And, as for the multiplication, for base16 we can get away with a simple shift(?)

And, as for the multiplication, for base16 we can get away with a simple shift(?)

Last night, I managed to get...

  const __m128i t1 = _mm_srli_epi32(v, 4);
  const __m128i t3 = _mm_or_si128(t0, t1);
  const __m128i t4 = _mm_and_si128(
      t3, _mm_set1_epi16(0x00ff)); // keep just lower bytes in words

as opposed to...

  const __m128i t3 = _mm_maddubs_epi16(v, _mm_set1_epi16(0x0110));

I don't think it is win.

Note that for simdzone if there's white space we need to stop as for contiguous strings that's a delimiter and the next set of characters may not be complete.

Can you elaborate? E.g., is it the case that 44fg3 c is equivalent to 44fg3c? How long are the base16 sequences? How many spaces are there?

There are many ways to deal efficiently with the spaces, but they appear matters a lot. The length of the base16 continuous sequences matters.

  1. If you have long base16 sequences (e.g., >= 32 bytes) with few spaces, then it may make sense to not worry too much about spaces, and just feed forward a leftover character, optionally. I'd still do it as one function so that the function call overhead does not dominate so much (e.g., don't constantly repopulate the registers).
  2. If you have short sequences with lots of spaces, then you may want to load the data, prune the spaces and process that... the routine is much the same, but you are actually eating up the spaces directly.

Note that for simdzone if there's white space we need to stop as for contiguous strings that's a delimiter and the next set of characters may not be complete.

Examples would be great.

I added a version (base16hex_simd_skipspace) that can turn "666 F 6 F6 261 72" into "foobar". That is, we skip the spaces. There is a performance penalty but it is acceptable. We still beat 2 GB/s on my benchmark.

$ ./benchmark
base16hex_simd                 :   2.83 GB/s  258.4 Ma/s   3.87 ns/d   3.20 GHz  12.37 c/d  45.49 i/d    1.1 c/b   4.15 i/b   3.68 i/c 
base16hex_simd_geoff           :   1.55 GB/s  141.2 Ma/s   7.08 ns/d   3.19 GHz  22.62 c/d  58.48 i/d    2.1 c/b   5.33 i/b   2.59 i/c 
base16hex_simd_skipspace       :   2.06 GB/s  187.7 Ma/s   5.33 ns/d   3.20 GHz  17.02 c/d  61.80 i/d    1.6 c/b   5.64 i/b   3.63 i/c 
base16hex_simple               :   0.81 GB/s   73.5 Ma/s  13.61 ns/d   3.19 GHz  43.46 c/d  85.65 i/d    4.0 c/b   7.81 i/b   1.97 i/c 
base16hex_simdzone_fallback    :   0.78 GB/s   71.2 Ma/s  14.04 ns/d   3.19 GHz  44.83 c/d  98.18 i/d    4.1 c/b   8.95 i/b   2.19 i/c 
commented

multiplication vs shift

I was incorrectly assuming a bswap would be required.
So it would have been shift -> or -> shuffle vs pmaddubsw -> shuffle.
However, using pack is good, it would save an extra shuffle if unrolled 2x.
For SSSE3 not unrolled _mm_storel_epi64 could be used to write 8 bytes.

we are significantly faster than Geoff's approach as far as I can tell

We might not hit a shuffle bottleneck here, but the Base64 decoder did get slower when I tried using an extra shuffle,
so it was just something I was thinking about.

Icelake seems to have two shuffle ports.

Also note: __m256i would require an extra shuffle to vpermq or extract the high 128-bits.

v = _mm_add_epi8(v, _mm_set1_epi8(-1))

v = _mm_add_epi8(v, _mm_set1_epi8(0x0F)) would reuse the nibble mask and save 1 "setup" instruction.

Note that for simdzone if there's white space we need to stop as for contiguous strings that's a delimiter and the next set of characters may not be complete.

Can you elaborate? E.g., is it the case that 44fg3 c is equivalent to 44fg3c? How long are the base16 sequences? How many spaces are there?

There are many ways to deal efficiently with the spaces, but they appear matters a lot. The length of the base16 continuous sequences matters.

  1. If you have long base16 sequences (e.g., >= 32 bytes) with few spaces, then it may make sense to not worry too much about spaces, and just feed forward a leftover character, optionally. I'd still do it as one function so that the function call overhead does not dominate so much (e.g., don't constantly repopulate the registers).
  2. If you have short sequences with lots of spaces, then you may want to load the data, prune the spaces and process that... the routine is much the same, but you are actually eating up the spaces directly.

Yes 44fg3 c is equivalent to 44fg3c. In the RRs where whitespace may appear in base16 (or base64) the field is the last one in the RR.

In current usage base16 with whitespace is used in DS, CDS, ZONEMD and SSHFP to present digests that cap out at 64 bytes (SHA-512). They probably represent the general case though the SMIMEA and TLSA RRs also use it for either a digest or a certificate (I'm not that familiar with those RR so that could be wrong).

The spaces question is trickier because of parentheses. From RFC 1035:

( )             Parentheses are used to group data that crosses a line
                boundary.  In effect, line terminations are not
                recognized within parentheses.

So space may encompass traditional whitespace, newlines, comments, or nested parentheses.

I don't have data on how the presentation format is used in the wild but anecdotally I don't believe I've ever seen base16 broken into whitespace on a nibble boundary outside of test cases and most tools seem to present base16 without whitespace.

For base16, we always require contiguous strings. Some RRTYPEs allow for spaces, some don't. e.g. in NSEC3 and NSEC3PARAM (for the salt) spaces are not allowed, while for DS they are. Most types that use base16 are covered in types.h. e.g. for the DS RRTYPE the digest field descriptor is this. Generally speaking, if the field is what I call a BLOB (no length octet, so length is determined by rdlength field in the packet and therefore always last), spaces are allowed. They are used mostly for readability.

It's hard to say how DNS operators present the fields. Since spaces are allowed, it's valid to do: a b c 1 2 3 e f g, but of course that never happens in practice.

As for examples:

  • SSHFP example from RFC
    host.example. SSHFP 2 1 123456789abcdef67890123456789abcdef67890
  • DS example from RFC
    example. DS 12345 3 1 123456789abcdef67890123456789abcdef67890
  • NSEC example from RFC (aabbccdd)
    0p9mhaveqvm6t7vbl5lop2u3t2rp3tom.example. NSEC3 1 1 12 aabbccdd (
                           2t7b4g4vsa5smi47k61mv5bv1a22bojr MX DNSKEY NS
                           SOA NSEC3PARAM RRSIG )
    
  • ZONEMD example from RFC
    example.com. 86400 IN ZONEMD 2018031500 1 1 (
        FEBE3D4CE2EC2FFA4BA99D46CD69D6D29711E55217057BEE
        7EB1A7B641A47BA7FED2DD5B97AE499FAFA4F22C6BD647DE )
    

There is another case where base16 can be used for all record types, and that's with generic encoding. RFC3597 introduced a generic format to present RRs unknown to the name server. Basically, you can present an A record in one of two ways:

host.example.com. A 192.0.2.1

or, in generic notation:

host.example.com. TYPE1 \# 4 c0000201

So, it's hard to say how long the sequence will be. Generic notation is rarely used (though it may be an interesting method to parsing the zone 🤔). It's probably best to focus on DS RRs (ZONEMD is one record per zone). For DS, the .se zone is probably a good example of real-world data.

There are many ways to deal efficiently with the spaces, but they appear matters a lot. The length of the base16 ontinuous sequences matters.

  1. If you have long base16 sequences (e.g., >= 32 bytes) with few spaces, then it may make sense to not worry too much about spaces, and just feed forward a leftover character, optionally. I'd still do it as one function so that the function call overhead does not dominate so much (e.g., don't constantly repopulate the registers).
  2. If you have short sequences with lots of spaces, then you may want to load the data, prune the spaces and process that... the routine is much the same, but you are actually eating up the spaces directly.

I had not considered the second option, but it's certainly worth considering. Thinking about it some more, we could introduce a lex function that guarantees a complete base16 field. We may then opt to ignore [ \t\r\n\(\)] (or just any character that is not a hex digit) when parsing the base16 field. Not sure if it helps, but it'd allow us to not copy the data?

Blog post: Decoding base16 sequences quickly. Whether this can be used in simdzone productively is an open question. Typically, one would first test whether a fast path is possible, and then use a routine such as these, and if not, then fall back on something slower. The check for a fast path needs to cheap, but to also cover a broad range of use case. This will require domain expertise to set up.

commented

Note: zero_masks table is probably not need here, because output bytes are formed from only two input bytes. At worst, bad chars could be zero'd out using blend ?

Nice write-up @lemire! It's good to have this work nicely documented.

This will require domain expertise to set up.

Pun intended? 😅 But, yes. Zone files are weird for people not that familiar with DNS (that includes me 🙂). Actually, they're weird for people familiar with DNS too.

I think that for .se this would make a difference. The zones I have laying around (one oldish, one newish) have the following number of DS records.

1433128 / 8924060 (DS / all) (2019-09-05)
1023294 / 8509310 (DS / all) (2023-02-02)

The older .com has ~2.9 million (341535563 RRs total). .se is a zone where a lot of domains are DNSSEC signed (there's others, .nl being on of them). So for .se, we'll probably see much better numbers while for the .com the difference will be noticeable, but much less significant (it had around the same number of NSEC3 RRs (base32)).

In both cases the base16 data is (SHA-1, SHA-2561 or SHA-384 encoded digests):

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX

We can probably do a base16 decoder just for DS records and keep using the very naive one for all other occurrences of base16 data? We simply call parse_fast_base32 as opposed to parse_base32 from the parse_ds_rdata function?

@lemire, I can have a look next week unless you want to have a stab at it? It's fine with me either way.

You probably don't want to use generic code. I suggest that the code should be tailored to the data you do have, with possible fallbacks when the data has an unexpected form. E.g., if you expect that the input should be XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX or XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX, then write tests for these inputs and write functions specific to these cases.

I'll have a go at it later this week (probably). I'll test some different scenarios and see which one works. I kind of like the idea of having a separate lex function for these sequences (base16 + base64) (both are used quite a lot, DS records and RRSIG respectively), but I'll have to make some modifications so it's likely not going to be trivial.

RFC3597 section 5 (Handling of Unknown DNS Resource Record (RR) Types) states the following about hex-encoded sequences:

Zero or more words of hexadecimal data encoding the actual RDATA field, each containing an even number of hexadecimal digits.

I figured the same would apply to the DS digest RDATA section, but BIND accepts fields of uneven bytes as long as the complete section is correct.

Base64 will have the same problem. Basically, you can wipe out all or most of the benefits of fast parsing if you optimize for arbitrary inputs instead of the inputs you do have in practice. If you go back to our orignal paper on base64, there is an appendix where we make the point that handling spaces at arbitrary locations can end up taking the bulk of the running time. It is just a fact that actual data has often low entropy: it comes in specific shapes most of the time, and that's what you want to optimize for.