BurntSushi / memchr

Optimized string search routines for Rust.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Runtime configuration of byte frequency table used to classify rare bytes

sentrip opened this issue · comments

memchr implements a generic SIMD accelerated search that is ideal for implementing something like CheatEngine where you scan the memory of an executable process to aid reverse engineering. This process involves repeatedly scanning for possibly millions of small values (u16, i32, ...) in the memory of that process. The user might have information ahead of time about the frequency distribution of bytes in the memory being scanned, which may vary wildly between executables. The user might also be able to control the program to ensure certain rare bytes appear in the program's memory at certain times.

There is an issue that prevents memchr from performing optimally when scanning binary executables - the byte frequency table. The core algorithm is based on detecting rare bytes with specific positions in haystack (the prefilter) and then testing these matches to check if the needle has been found. As mentioned in the incredibly detailed comments, the performance of this algorithm is highly dependent on the byte frequency table used to determine what is a rare byte. While the table that is included in memchr is optimal for the majority of cases, there are some specific data types that have very different byte frequency distributions, which causes memchr to perform worse on those inputs than it otherwise might with a different byte frequency table.

To illustrate this point, consider the following byte frequencies (where ideal is the ideal frequency for an x86 binary):

byte memchr ideal
\x00 55 255
\xdd 255 0
\x8b 80 186
H 150 254

Now, consider scanning for the needle H\x00\xdd\x8b in an x86 binary. memchr would identify \x00 and \x8b as the rarest bytes, when they are in fact common bytes. Even if memchr considered \x00 to be a frequent byte via configuration, it would still choose H and \x8b as the rarest bytes, which are both much more common than \xdd, the only actually rare byte. This would result in a lot of unnecessary false positives, decreasing the throughput. This is a simple case, but it is easy to extend this idea to many other pathological input sequences that defeat the default frequency table, and might also reasonably appear in an executable or be scanned for by a user.

Now consider a haystack that contains HHH\x00\xdd\x8b. The user might know in advance that searching for HHH\x00 and searching for H\x00\xdd\x8b will both return a single unique match, the sub-slice that was mentioned earlier (the exact indices are not identical but that is not the point). The user might also know that \xdd is a very rare byte in their dataset. The user should be able to choose scanning for \xdd instead of a more common byte to speed up their searches. I cannot imagine how to support something like this without providing the user a mechanism for customizing the byte frequency table.

The proposed solution is to allow the user to specify the byte frequency table at runtime by modifying the memchr::memmem::rarebytes::rank function. Currently, this function reads from the global byte frequency table.

My first idea was to create an enum that can be provided to a FinderBuilder and then forwarded to RareNeedleBytes to choose the table:

enum ByteFrequencies<'a> {
    Default,
    Custom(&'a [u8; 256]),   
}

This enum can be stored in the NeedleInfo struct and used at runtime to determine which byte frequency table to use. However, this introduces the lifetime 'a, which may or may not be the same as the needle ('n) and haystack ('h) lifetimes that are stored in related structs. Considering lifetime 'a to be separate and different requires the public API of Finder to be changed to add this lifetime.

  • ByteFrequencies -> NeedleInfo -> Searcher -> Finder

I believe that the extra lifetime might make life more difficult for the compiler, which is why I observed a small but noticeable (around 10%) impact on the performance of constructing a Finder with the default frequency table on my local machine.

Also, by introducing a new member on the struct NeedleInfo, the size/alignment properties of Finder, Searcher and NeedleInfo changed, which also might be the reason for the performance impact I observed. (if this sounds crazy to anyone I suggest you take a look at the wonderful performance talk by the legend Emery Berger titled 'Performance Matters' for more details https://www.youtube.com/watch?v=r-TLSBdHe1A).

An idea to remove the generic lifetime from ByteFrequencies:

enum ByteFrequencies {
    Default,
    Custom(&'static [u8; 256]),   
}

However, I believe this static API is logically inconsistent with the FinderBuilder API. You can construct millions of unique Finders at runtime and then discard them later, but the same cannot be said for static arrays.

Also, the user might want to perform analysis of their specific corpus at runtime to generate a specialized byte frequency table (like 'pre-training'). This is a very interesting use case in the context of the analysis of binary executables, as there is a lot of information that can only be obtained at runtime and can be useful in optimizing many kinds of searches. Forcing the user to use a static byte frequency table would necessarily prevent this use-case.

Another idea to remove the generic lifetime and also allow runtime generation of the byte frequency table:

enum ByteFrequencies {
    Default,
    Custom(Box<[u8; 256]>),
}

However, an issue with this approach is that the ByteFrequencies enum has a size of 16 bytes which is mostly wasted. Another issue is that it seems that conceptually we should be passing around some kind of reference to a byte table that can be reused, instead of copying the table for each construction, but that ultimately depends on benchmarks. Also, now the standard library and memory allocation are required for an operation that is unrelated to both of those things (Rc, Arc and others have similar issues).

I also tried storing the byte table inline, but this had disastrous results on performance. This is probably because this extra storage pushed important members on related structs into new cache lines, which affected subsequent operations on these members.

enum ByteFrequencies {
    Default,
    Custom([u8; 256]),
}

One thing I have not tried yet but might be interesting is trying to re-organize the members and memory layout of any struct that stores a ByteFrequencies object. This might allow using an inline byte frequency table for example, but would likely result in breaking changes to the layout of public structs in memchr. Even just introducing the ByteFrequencies object already changes the memory layout of certain structs, which I am not sure about whether it is something undesirable or not.

All of this culminated in the pull request I submitted, but I realize now it is better to just lay it all out here and figure out the best path forward together. I appreciate any feedback you may have on these suggestions.

P.S. I think memchr is an incredible library and the code quality and detail of documentation definitely helped me greatly in understanding the internals and even being able to suggest this in the first place, so kudos.

Thank you this is a lovely issue! I definitely agree this is a problem we should solve and I think it's solveable.

One thing that sticks out to me here is that the representation of "byte frequency table" as a [u8; 256] is perhaps itself just an implementation detail that doesn't necessarily need to be exposed. Namely, even internally, the table isn't really used directly. Instead, what's used is a ranking function:

/// Return the heuristical frequency rank of the given byte. A lower rank
/// means the byte is believed to occur less frequently.
fn rank(b: u8) -> usize {
crate::memmem::byte_frequencies::BYTE_FREQUENCIES[b as usize] as usize
}

Indeed, I think that all we actually need from the caller is something that takes any byte value as input and returns a rank (in the range 0..=255) as output. Ideally, the contract of the function would be as simple as this: a smaller rank represents bytes that are heuristically assumed to be less frequent. (The reality is that this may not quite be the full story. While the memchr crate doesn't currently do this, it is plausible that there could be checks like "if the rank of the rarest byte is above a certain value, then don't use a prefilter whose perf is tied to the frequency distribution of bytes at all." This means that it may not just be the relative rank that matters but also the absolute rank. Which is... kind of unfortunate because it makes the contract quite a bit more complicated.)

What this means is that we have a fair bit of flexibility in what we could accept. There are possibly more options, but I'll just list the ones I can think of off the top of my head quickly:

  1. Accept a fn(u8) -> u8. This is kind of analogous to accepting a &'static [u8; 256] in that it doesn't permit much flexibility. But because it's a function pointer, the caller can technically insert some shenanigans to do various things, including using a frequency table that has been computed at runtime. (Such a thing would look quite similar to your PR implementation I imagine.)
  2. Accept a Arc<dyn (FnMut(u8) -> u8) + Send + Sync + UnwindSafe + RefUnwindSafe + 'static>. This requires alloc unfortunately, but otherwise works like (1).
  3. This is almost isomorphic to (2), but we could define a new trait HeuristicFrequencyRank that can be implemented by any type (including zero sized types). But it has the same problem as (2). In order to avoid the type parameter from infecting the type signature (which I would want to do even if we were designing this API from scratch), we'd have to erase it. And the only way to erase it without introducing a lifetime parameter is with an allocation. So this approach would also require alloc.

It is quite tempting to go with (1) because it's a universal API. If we didn't care about the no-std no-alloc use case, then I think (2) would be my preferred route.

One possible riff on (3) is to add a default type parameter to Finder. It would default to a zero sized type representing the default byte frequency table, but could be overridden to be anything else. This wouldn't require alloc because we would be accepting the infection of the type parameter instead, but kind of hiding it via the default type parameter trick. I believe it is the case that adding a default type parameter to an existing type is a non-breaking change.

There are also more complex ideas, such as combining the options above. For example, we might expose (1) in all configurations but (2) in just configurations that have alloc. But I'm not a huge fan of that because of the complexity, and I do wonder whether it might have a perf impact because of the additional space required to store both things.

I think I like (3) with the default type parameter trick the best? Because it should literally compile down to absolutely zero overhead for any table that is truly static, since it would just be implemented via a zero-sized type. As long as the default type parameter is a non-breaking change (and again I believe it is), it is a lamentable API addition but one that I think I can live with because folks can legitimately completely ignore it and not be worse off. The addition of an entire trait also kind of sucks too, and I honestly hate adding traits to crates unless they're really well motivated. But because of the no-std no-alloc constraint here, our choices are quite limited.

(Note: There is currently only a std mode and a no-std no-alloc mode. There ought to be a no-std-but-with-alloc mode too, however it hasn't been added yet because most things in this crate don't benefit from it. The only exception to this currently is the Finder::into_owned API, which is currently only available when std is enabled, but should in theory only require alloc. Otherwise, generally speaking, the only impact std has is that it enables runtime CPU feature detection to unlock SIMD optimizations that utilize AVX2.)

Thanks for the quick and very detailed response!

I agree that HeuristicFrequencyRank (3. with default type parameter) is the most flexible option, and the most likely to not regress performance given how in the default case the assembly should likely be very similar or identical to the original implementation, as the zero-sized type should be optimized away (like you mentioned).

the representation of "byte frequency table" as a [u8; 256] is perhaps itself just an implementation detail that doesn't necessarily need to be exposed

I agree 100%. For some reason I completely ignored 'generic' implementations (like traits/functions) probably due to how small the input space is (0..=255) and the many restrictions that come along with no-std no-alloc considerations.

or example, we might expose (1) in all configurations but (2) in just configurations that have alloc. But I'm not a huge fan of that because of the complexity

I am also against that kind of complexity, especially when it results in a lot more cases that need to be tested which I believe is the case here. Also the added complexity is pretty much an implementation detail due to no-std no-alloc rather than being necessary to represent the fundamental idea (mapping u8 -> u8).

I believe it is the case that adding a default type parameter to an existing type is a non-breaking change.

I would imagine ideally it wouldn't but in practice might affect the output and I honestly do not know enough about how the rust compiler works yet to make that kind of decision. However since it seems worth investigating I will look into it.

it is a lamentable API addition
I honestly hate adding traits to crates unless they're really well motivated

I also agree with these, and the idea of changing the pubic API for people that are not interested in this feature makes me uncomfortable.

My first goal will be to implement HeuristicFrequencyRank and the default type parameter with as little impact as possible on the assembly generated for code that uses the default frequency table. I'm going to benchmark this with a specific benchmark that is only for measuring the time to create a Finder and use the selected frequency table once (or as few times as possible).

If I can implement this without affecting the default code path, I will consider it a success. I am still going to investigate methods that do not affect the public API at all, however If I am unsuccessful with that then I will upload the HeuristicFrequencyRank version as a PR.

It goes against my principles to change the public API (even if it still compiles) AND regress performance (even a small amount) for users that do not care about this feature. It seems to me that this would (rightly) encourage these users to use an older version of the library. Also in rust we like zero cost abstractions. I think I would only feel comfortable introducing something like this if EITHER the public API changes (but not how it is used) OR the default code path is minimally (1%-2%) slower, but not both.

Some more thoughts I had not directly related to what is above, but related to what you said:

Accept a fn(u8) -> u8. This is kind of analogous to accepting a &'static [u8; 256] in that it doesn't permit much flexibility

This is interesting, but I think it might be more problematic than at first glance. If a user is going to do anything based on runtime info to generate the frequency table, then they have to include all of the data they need in some static struct that they use in the fn(u8) -> u8 that is passed along to memchr (similar to the original PR). This involves either slow or unsafe code that ideally the user should not write.

I also think it kind of defeats the purpose of a generic solution if the only way to comfortably use that generic solution is to do the exact same thing you could already do with the non generic solution. In this case, the fn(u8) -> u8 would pretty much just end up being a wrapper for a static array, unless the user does some ugly stuff themselves. In this case, why not just use a static array directly?

If memchr were to implement static storage and the unsafe code for just the frequency table, then the user can avoid having any statics/unsafe in their code when generating frequency tables at runtime. However, all of what you mentioned in the original PR made me realize that implementing this inside of memchr is not a good idea, so I don't think I'll be investigating this. I would like to mention however some thoughts I had on this for the sake of completeness (ignore the next paragraph if that does not interest you):

If memchr has a function to change the global byte frequency table that is thread safe, then the user can avoid all static/unsafe code. Furthermore, since the table is only queried during the construction of a Finder, that is the only time that the state of the global is relevant at all. If the purpose of a Finder is to be constructed ahead of time, and then re-used for multiple searches, then the user is probably creating finders in a synchronous loop ahead of time, and then using those finders immutably with multiple threads. If the user could call something like set_byte_frequencies before constructing each Finder, which they are likely doing in a small simple loop, then they could arbitrarily modify the frequency table at runtime safely without touching the original code at all. In the original PR I only considered setting a new static byte frequency distribution, not generating arbitrary distributions at runtime, but I think the latter could also be achieved by storing the entire [u8; 256] in a static variable alongside the pointer that stores the frequency table and copying the runtime frequency table if necessary. Regardless, I am not going down this path anymore but felt it might be helpful/interesting to someone reading this to write some of this down.

Your approach SGTM.

It goes against my principles to change the public API (even if it still compiles) AND regress performance (even a small amount) for users that do not care about this feature. It seems to me that this would (rightly) encourage these users to use an older version of the library. Also in rust we like zero cost abstractions. I think I would only feel comfortable introducing something like this if EITHER the public API changes (but not how it is used) OR the default code path is minimally (1%-2%) slower, but not both.

Just to give you my feelings on the matter:

  • Search speed is the most important thing in the context of this crate. With that said, measuring search speed changes at the level of 1-2% is pretty difficult in my experience because of noisy benchmarks. So if you wound up with a 1-2% regression in search speed, I'd probably throw that in the "cost of doing business" bucket and move on. I doubt anyone is going to pin or fork this crate over 1%.
  • If you're referring to searcher construction, then 1-2% feels OK to me. You probably have even more wiggle room than that. Searcher construction is already fairly slow, and it isn't hard to see it in real benchmark. (Sorry, that section in my blog post is a bit long, but this comes up around the midway point of that section.) However, unlike many other substring search implementations (such as the one in Rust's standard library or even the venerable memmem from libc), this crate provides a way to mitigate the cost of searcher construction. You can build a searcher once and execute searches with it many times. Now obviously, we still want searcher construction to be fast, so you don't have carte blanch here, but a small regression isn't the end of the world.

Basically, in my experience, setting the standard to "doesn't change codegen at all" actually winds up being quite difficult to live up to in practice. Compilers do all sorts of crazy things, and there are cliffs and thresholds everywhere such that even a slight perturbation somewhere has a domino effect that impacts codegen substantially. But that doesn't necessarily mean actual wall clock times are impacted substantially.

So in summary, you probably have a little more wiggle room than you think. And I would be surprised if this trait approach with a default type parameter resulted in a perf change big enough to negate the strategy entirely.

In this case, why not just use a static array directly?

To be clear, the reason is precisely because with a function pointer, the caller can do shenanigans that make it possible to swap out the frequency table for something else at runtime. Asking for a &'static [u8; 256] means that's not possible to do without either unsound APIs or leaking memory. Basically, the function pointer introduces indirection in a way that is compatible with the no-std no-alloc constraint.

The way something like this would be justified is the idea that needing to use a different ranking function than the default is itself quite niche. And thus, forcing the caller to jump through hoops---but hoops that are possible and sound---to achieve their use case is "okay" with me generally speaking. That includes needing to write unsafe, especially when this is probably on the "easier" spectrum of unsafe IMO. But in this particular case, if we can make the default type parameter approach work then I think that's probably better enough to go that route than the more cumbersome fn(u8) -> u8 approach.

Thanks for the insight!

I agree with you about the regressions and was probably exaggerating a bit, but I would much rather have you tell me that than come in here with non-conservative ideas about regressing performance for all users :)

Basically, in my experience, setting the standard to "doesn't change codegen at all" actually winds up being quite difficult

Yeah you are totally right, and I honestly don't usually stress out too much given that the benchmarks are doing what I want them to. However, due to how popular memchr is and how the whole point of the library is speed I wanted to try everything before making modifications that regress performance.

I do think I've found a good solution. Here's a brief explanation of what I've done (in case you have some thoughts or are curious):

I realized HeuristicFrequencyRank does not need to be stored anywhere, it only exists for the duration of a call to Searcher::new. This allowed me to avoid touching the public API at all. Now there are 2 versions of Searcher::new, where one of them calls the other with a hardcoded value for HeuristicFrequencyRank.

There are only 2 additions to the public API, and no changes:

  • HeuristicFrequencyRank trait
  • FinderBuilder::build_heuristic generic method that accepts HeuristicFrequencyRank

There are a few tweaks to the internal code but they are also minimal.

Finally, the default code path has (at least on my machine) identical assembly and no performance regression (there is a benchmark to prove this too). Also, the cost of using HeuristicFrequencyRank when constructing a finder is a result of the performance difference of direct and indirect calls in assembly and pretty much nothing else.

P.S.

I believe it is the case that adding a default type parameter to an existing type is a non-breaking change.

You were right about this, I was able to generate identical assembly to the original implementation with the default type parameter implementation.

Ah right, of course! The table is indeed only used during construction and I believe I generally expect that to remain true.

The second build_heuristic method is a little kludgy, but is maybe better than adding a type parameter to FinderBuilder. (Which I think would be required if there was a mutator method to "set" the table.)

The names might need some work too.

The second build_heuristic method is a little kludgy

Yeah you're right, I was having an issue with incorrect codegen (indirect calls) when storing the trait in the Searcher, so I kind of ignored the idea of storing the trait and moved on. I will change the implementation to use a mutator method and it shouldn't affect performance at all I think since the FinderBuilder can still call both the generic and regular versions of Searcher::new, which was the main problem.

So that only leaves the HeuristicFrequencyRank. My original inclination was to call it something like ByteFrequencies, but honestly a properly descriptive name seems quite long however you do it.

I would imagine the name of the mutator method should be the name of the trait in snake case or something similar, so I guess if that name is good then the method will also make sense.

Edit: Just realized I misread your comment, apologies.
The reason I thought the second build_heuristic method is better is to avoid changing the API, but I also agree that a mutator method would be more consistent with the FinderBuilder::prefilter method.
Either way I will try to think of some better names, especially for build_heuristic.

Sorry for the radio silence, I've had a lot of things to do this week and since I implemented the HeuristicFrequencyRank version in a local fork, I have just been using that.

I have documented and commented the code and am ready to upload a pull request, but there are a few more things I wanted to mention before and also list some names I have thought of to see if we can choose the best one. However if you feel we should continue the discussion in a PR I will upload what I have now and we can continue there.

Firstly, when I implemented the mutator method with the default type parameter for setting a custom HeuristicFrequencyRank I ran into what appears (to me at least) to be a fundamental problem.

The following line is in Finder::new, but is part of the public API and could be written by a user:

FinderBuilder::new().build_forward(needle)

When FinderBuilder is generic, then the line above causes an error, regardless of the default type parameter (requires ::<T>). The following fix illustrates the problem.

type T = FinderBuilder;
T::new().build_forward(needle)

I believe the reason is that FinderBuilder is interpreted ONLY as a generic type when part of a regular expression, whereas in a type expression FinderBuilder is parsed based on what comes after. As far as I can tell this is a breaking change to the public API, and I cannot think of any way to resolve this.

Also there were a few other warts that made me feel worse about the default type parameter implementation (could just be me though). Off the top of my head:

  • Having to implement a custom Default for FinderBuilder<H> to avoid HeuristicFrequencyRank: Default
  • Having to change the signature of all FinderBuilder methods to move-only (i.e. fn x(self) { ... }) to avoid HeuristicFrequencyRank: Copy
  • The mutator method returns a different type to the current type (FinderBuilder<DefaultH> -> FinderBuilder<CustomH>)

All in all I prefer the build_heuristic route due to the minimalism, but if you have some more suggestions I am happy to hear them.

That being said, after staring at my code for a few days I have to say I am definitely not a fan of the names at the moment (both HeuristicFrequencyRank and build_heuristic.

One idea I had is to just call it Heuristic. This would simplify and standardize all of the names, but is not very descriptive. However, I feel the argument can be made that memchr just uses the one heuristic, so if this heuristic is namespaced (memchr::memmem::Heuristic) then in theory it shouldn't be ambiguous.

Furthermore, since it is such a niche feature if someone intends to use it properly then they will likely need a much deeper understanding of what Heuristic does than can be gained from a more descriptive name. You mentioned that I'm the first person to ask for this in the many years this library has existed, and my intention (and I imagine the intention of others like me) is to put code that uses the Heuristic API in some library code that can be used without worrying about those specifics.

However I am not sure if I am 100% convinced by this argument, but I felt it was definitely worth mentioning. Some more ideas I've had for the names:

HeuristicFrequencyRank

  • HeuristicFrequency (there is a rank method)
  • ByteFrequencies/ByteFrequency
  • RareBytes (public counterpart to RareNeedleBytes?)
  • RarityHeuristic/HeuristicRarity (shorter, and memchr only deals with bytes anyway)
  • ByteHeuristic (more descriptive than Heuristic, similar reasoning)

build_heuristic is a bit harder I think.

  • build_heuristic is ambiguous (and does not necessarily imply forward)
  • build_forward_heuristic is still kind of ambiguous
  • build_forward_with_heuristic is very long
  • just build is ambiguous and would be confusing since this is not a 'common' use case
  • just heuristic or with_heuristic are inconsistent with build_forward/build_reverse

I also think the issues above would still apply if heuristic were replaced with another word.

Can you throw up what you have? I'll take a look. If the default type parameter path is indeed a dead end, then yeah, I think something like build_heuristic will do fine.

As for the naming... I'm not quite sure. I think the issue is using "heuristic" absent any other context. There are lots of heuristics that go into memmem here, so it isn't a particularly discriminating name. The "rank" concept is really the key ingredient for this option, so the word "rank" or a synonym should be involved somewhere here.

A slightly longer name is also acceptable to me personally since this is going to be a very infrequently used feature.

So I'm thinking things like... build_with_ranker or build_with_byte_ranker or something along those lines.

Yes I will upload what I have.

There are lots of heuristics that go into memmem here

Yes of course, I meant only one that the public can touch directly, but I agree completely which is part of the reason I wasn't convinced either.

Just to say, build_with_ranker sounds good to me, especially if combined with a trait that has byte in the name, but I like both better than build_heuristic.