BurntSushi / suffix

Fast suffix arrays for Rust (with Unicode support).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Recommendations for dictionary of strings with wildcards?

rrichardson opened this issue · comments

I am implementing an MQTT Broker. After connection, a client subscribes to one or more topics, which might contain one of two wildcard types.
A regular topic might look like foo/bar/baz/blah .
A # wildcard would match the entirety of the remaining string, a + wildcard would match only a single path segment, e.g. foo/bar/+/blah.
When a message is published to a topic, the broker would find all matching subscriptions and deliver that message accordingly.
This is a bit of a twist on the traditional string search problem, because the wildcards are in the dictionary instead of the query string.

It is expected that the broker should be able to handle 10,000 of connections, with each connection subscribing to an average 20 topics each. It is also expected that there would be many duplicate connections.

I can think of one approach, and I'm not thrilled by it, and that is to build the table using only the path segments themselves, and perform successive searches using only the segments of the path segments of the query. Then I could use the resulting indices into the table as keys into a map of subscribers.

It would still be a bit tricky, because it would require an n-gram search for the resulting subscribers.

e.g.
subscribers:
A: is at pizza/+/meats/sausage
C: is at pizza/toppings/fruits/tomatoes
D: is at pizza/#

Naively**, the dictionary is

pizza/+/meats/sausage$pizza/toppings/fruits/tomatoes$pizza/#
01234567890123456789012345678901234567890123456789

a match at 0, 24,31, 36 would find subscriber A. A match at 0 would find subscriber D.
A message is published at pizza/toppings/fruits/tomatoes

** I big optimization here is that the dictionary doesn't need to duplicate path segments, so we could ensure they're unique. Given the nature of mqtt subscriptions, this would be a big win.

I'm thinking that might be too much effort trying to wedge the suffix table approach where it doesn't fit, and maybe I'm better off building a suffix tree where I can explicitly create and handle wildcard nodes. Also a bitmap index seems promising.

Even simpler but perhaps less performant would be to just build a massive Regex out of each subscription string. (might work for 100 subscriptions, but get unwieldy at 1000+, which is certainly in the realm of possibility.

p.s., I just found http://blog.burntsushi.net/transducers/ which I am attempting to digest.

I'm having a little trouble following this. Could you give more clear concrete examples of what the inputs and outputs are of the thing you want to build? (To be honest, it sounds like you know the space of possible solutions well enough that I don't think you'd need my help. :-) But I'm happy to try to understand more!)

The problem space is hierarchical path matching, similar to paths->routes in an HTTP server. A message comes in, published to a path (topic), and we need to find one or more subscribers for a path. Where the paths themselves might have wildcards, either for a segment (foo/.?/bar) or matching only the prefix of a topic (foo/.*) .

The challenges is :

  1. The set of topics and subscribers changes fairly frequently at runtime.
  2. There could be multiple subscribers at each path.

Since I need to associate a list of subscribers with the matching paths, I'll need either a specialized tree, or a pair of lookups. I was thinking of using a RegexSet for the initial test, then a HashMap of to map full topic strings to subscriber lists.

The other approach I was thinking of testing was building a suffix table using only unique segments. Then making reverse a bitmap index into the lists of subscribers.
E.g. somebody subscribes to foo/.?/bar then they'd go into the index at

foo | baz | bar | blah
1        0      1      0

The requirement for frequent mutability is the other thing leading me towards a custom suffix trie. One that would support wildcards at any node...

I created a test bench for topic matching here : https://github.com/rrichardson/topiclab/blob/master/src/main.rs

This allowed me to fairly confidently rule out RegexSet as a candidate :

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: CompiledTooBig(10485760)', /checkout/src/libcore/result.rs:906:4

I'll try the suffix array + hashset + bitmap next and see how that fares.. If that goes poorly I might have to bite the bullet and write the suffix trie.

This allowed me to fairly confidently rule out RegexSet as a candidate :

The compile size limit isn't hard-coded. You can configure it: https://docs.rs/regex/0.2.2/regex/struct.RegexSetBuilder.html#method.size_limit

You will probably also want to configure the DFA size limit, but it's a bit more magical since a limit of 0 is legal, and the regex engine will just seamlessly switch to the slower engine: https://docs.rs/regex/0.2.2/regex/struct.RegexSetBuilder.html#method.size_limit

frequent mutability

I still haven't had a chance to review your comments in enough detail to understand them, but in my experience, mutability with these sorts of things is usually dealt with by permitting multiple instances of a structure and then periodically merging them. So for example, you might have many regex sets, and then once some threshold is crossed, you might re-compile them all into one and then repeat. Lucene does a similar thing with its segment merging.

Thanks for that pointer. I'll give it a try.
As far as a concise summary of requirements goes it would be this :

  1. I need a dictionary of hierarchical search paths, very similar to REST style paths, including wildcards for a segment of a path and a different wildcard to indicate that only the prefix up until that wildcard character needs to match. e.g. foo/bar/.+/baz/.* . (although the mqtt protocol specifies different characters than regex)
  2. There could be as many as 10,000 such search paths in the dictionary.
  3. It could change fairly frequently.. let's assume that it is only every couple of seconds.
  4. It is expected that Multiple threads would be matching against the dictionary a couple thousand times per second.

There could be as many as 10,000 such search paths in the dictionary.

My feeling is that this is probably the upper end of what I'd feel comfortable putting into a RegexSet. It's possible it's even beyond the upper end. I'm not really sure. But you may need to bump the DFA size limit up quite a bit, keeping in mind that the DFA size limit roughly corresponds to the amount of heap that the regex engine will use while matching, per thread.

It is possible that a lower tech solution will work for you. For example, does every path in your dictionary start with a literal prefix? And if so, what is the distribution of said prefixes? You might consider a two pass search, where you first narrow down your candidates based on which prefix matched. And then assuming you have a structure that lists the remaining paths corresponding to that prefix, you could proceed with a potentially more naive search at that point. Just spitballing though.

or example, does every path in your dictionary start with a literal prefix?

There are other variants of this question, such as, "does every dictionary item contain any literals at all?" and if so, you could run with that too.

The huge advantage is definitely that the prefixes will all be very similar, also the set of all unique segment values will be small as well. I am making a general purpose broker, so I can't speak for all users, but I expect that the distribution of paths would roughly reflect REST in that you'd see a lot of /namespace/subject/verb/object/<id> where id is high entropy.
or /namespace/subject/<id>/verb/object/<id>

Ah, yeah, "general purpose" makes it a trickier problem, but I think you get the gist of my idea. I'd try exploiting some structure in the items in your dictionary.

Upping the size_limit and dfa_size_limit allowed me to build and test. The very naive approach demonstrated in the current topiclab takes 230ms to build a set 10k topic streams, and an average of 0.6ms to match similarly a similarly sized string.
Also I set the dfa size to 1GB. I'm not sure if that is a static initial size or if it grows to that, but my bencher was using 1.6GB of ram for the test :)

0.6ms for a multi-match was better than I was expecting. I am hoping for at least of order of magnitude improvement in match speed and space with some obvious optimizations that you pointed out.

Sounds great!

Also I set the dfa size to 1GB. I'm not sure if that is a static initial size or if it grows to that, but my bencher was using 1.6GB of ram for the test :)

It is not an initial size. That's just the amount of memory that the DFA will be allowed to use to create states. If it exceeds that limit, then it will divert to the slower engine.

I'd definitely be curious to see if that 1.6GB was, in a significant way, attributable to the regex engine. That seems like a lot. :-)