The construction algorithm does not scale well.

Question

The construction algorithm does not scale well.

aolieman opened this issue 9 years ago · comments

It is noted in the documentation that the construction algorithm is not suitable for more than a couple of thousands of keywords. In my experience, the memory requirements go through the roof at around 10K keywords.

I am, however, quite motivated to (somehow) use Aho-Corasick search in python with > 100K keywords. Several workarounds are possible for my current use-case, but if it is possible to improve the scalability of Acora, I'd much rather spend my time on contributing to that.

So, where do we start?
It seems that tree construction is possible in linear runtime: http://link.springer.com/chapter/10.1007%2F11496656_15

scoder · Answer 1 · Thu Apr 16 2015 02:58:32 GMT+0800 (China Standard Time)

Yes, it's sad that the construction is such a serious bottleneck. The problem is that it a) unfolds the tree from a nondeterministic automaton to a deterministic one using powerset construction, meaning that it suffers from state explosion, and b) does it in Python dicts, which are not a memory efficient representation of a tree. There are certainly tree constructions available that perform substantially better and produce more efficient trees. The paper you referenced seems to provide such a way. There's a pre-print version here, BTW:
http://cs.haifa.ac.il/~landau/gadi/shiri.pdf

The tree construction in acora is done entirely in Python, though. It's in _nfa2dfa.py. If you can get a Python implementation of a better tree construction running, it should be possible to integrate it with the search engine without major changes.

Alex Olieman · Answer 2 · Fri Apr 17 2015 02:54:12 GMT+0800 (China Standard Time)

Thanks for the explanation! Good news that the tree construction is Python-only; that gives me a much better chance at providing an alternative than if you had coded it in C.

Guess I have to do some homework about the implementation specifics in the paper I referenced. I'm pretty sure the NFA->DFA transform is the bigger problem of the two you mentioned. Using dictionaries may not be memory-efficient for this purpose, but that alone wouldn't cause the exponential memory requirement that seems to be present now.

Using the method described in the paper should be possible using a Python-only trie, but if we're going for scalability it is probably good to use e.g. marisa-trie. How would you feel about such a dependency? The most important criteria according to me are covered: pip-installable, supports unicode, and can be pickled.

By the way, the current workaround I'm using is to split my set of keywords into subsets of maximum 10K and to construct a separate tree for each. Then, I apply a very basic form of federated search to get results from each tree. It works okay, but selecting the longest matches is a bit of a hassle.

scoder · Answer 3 · Fri Apr 17 2015 13:00:11 GMT+0800 (China Standard Time)

Adding a dependency on marisa-trie might be premature optimisation. And it seems it doesn't support byte strings out of the box, so that would have to be faked with Latin-1 decoding/encoding. I'd say, let's see how far we can get without it.

Splitting the construction is only a work-around. If algorithmic improvements can be implemented, they are largely preferable.

Alex Olieman · Answer 4 · Sun Apr 19 2015 21:36:04 GMT+0800 (China Standard Time)

Of course only an algorithmic improvement would be sensible to include in Acora. I just mentioned my workaround for searchers who stumble across this issue report in the meantime. Once I extend the workaround with satisfactory longest overlapping match selection, it could be added as a recipe in the docs though.

Given my current schedule, I'll only get around to giving the alternative tree construction algorithm a serious go after the summer.

I guess you're also right that a dependency on marisa-tree would be premature optimisation. I might as well try implementing the algorithm described in the paper using only basic Python datatypes. Then see where we go from there based on (memory) profiling.

Philippe Ombredanne · Answer 5 · Sat Jan 02 2016 06:07:44 GMT+0800 (China Standard Time)

@aolieman Any progress?

Philippe Ombredanne · Answer 6 · Sat Jan 02 2016 07:12:58 GMT+0800 (China Standard Time)

@scoder @aolieman reading the algorithm from @shirki I am under the impression that there is a big hidden constant in the linear construction.
See also http://www.cs.ucr.edu/~stelo/cpm/cpm05/cpm05_4_5_Dori.ppt

You have to:

sort inputs
construct the trie
construct a generalized suffix tree on the reversed inputs
compute the proper ancestors on that suffix tree to create the failure links.
create failure links in 2. from 4. by mapping the tree to the trie

@shirki: am I missing something? Would you have an example of code by chance?

scoder · Answer 7 · Sun Jan 03 2016 19:50:19 GMT+0800 (China Standard Time)

Note that this code is currently being rewritten. I'll soon have a correct implementation available for testing.

Philippe Ombredanne · Answer 8 · Sun Jan 03 2016 20:09:52 GMT+0800 (China Standard Time)

@scoder Thanks! ... Unrelated... Do you think that an id could be added to each strings stored in the automaton?

Alex Olieman · Answer 9 · Sun Jan 03 2016 21:06:20 GMT+0800 (China Standard Time)

@scoder that's wonderful! Looking forward to your implementation.

@pombredanne regarding the addition of IDs: what would be the use-case? Given that your keywords are unique, I see no problem mapping them to IDs external to the automaton (e.g. using a dict or database).

Philippe Ombredanne · Answer 10 · Mon Jan 04 2016 02:20:40 GMT+0800 (China Standard Time)

@aolieman This all depends on your definition of what a keyword is. This can be an arbitrary long string. They do not have to be unique either. For long strings and many of them, storing an external dict with the string as key is a terribly bad solution: at best you end up storing the strings twice, at worst you can exhaust RAM.

Philippe Ombredanne · Answer 11 · Mon Jan 04 2016 02:23:14 GMT+0800 (China Standard Time)

@aolieman ... which is why most Trie-like structures allow you to store an id (or more) in a dict-like fashion: pyahocorasick, esmre, datrie, dawg, marisa-trie and most of them use this approach

Alex Olieman · Answer 12 · Mon Jan 04 2016 04:31:28 GMT+0800 (China Standard Time)

@pombredanne I see that you've opened #8 to discuss this issue. It's a legitimate request of course, although I wonder what you mean by non-unique keywords.

Philippe Ombredanne · Answer 13 · Mon Jan 04 2016 07:01:05 GMT+0800 (China Standard Time)

@aolieman You say keywords, but I think of arbitrary string. Say I want to index the text of a 1000 web pages, each treated as a "keyword" of possibly 1000 words and characters. I may do not know ahead of time if as a coincidence two of these strings are duplicates.

Philippe Ombredanne · Answer 14 · Mon Jan 04 2016 07:14:47 GMT+0800 (China Standard Time)

@aolieman another case where id matters would be if you store short words, bona-fide "keywords" in multiple languages, say English and Dutch in the same trie. There may be two words that are spelled out the same in both languages. Knowing if a word is defined in one or both language matters. I would not want to have to "dedupe" each language ahead of time to pick one unique keyword. There again it does not make sense to me space-wise to store the words both in the trie and in another dictionary. If anything, one of the interest of a trie is some compression.

scoder · Answer 15 · Sat Jan 09 2016 21:44:24 GMT+0800 (China Standard Time)

I uploaded the rewrite in a new branch: https://github.com/scoder/acora/tree/rewrite
It's clearly better than the old implementation, so please give it a try. I'm considering to remove some of the backwards compatibility code that now gets somewhat in the way. Can't say yet if that would improve things further, although I would hope so. It might allow for further tuning of the initial tree construction.

Philippe Ombredanne · Answer 16 · Sat Jan 09 2016 23:20:15 GMT+0800 (China Standard Time)

@scoder Thanks! Can you highlight how the new implementation works in a nutshell ?

scoder · Answer 17 · Sat Jan 09 2016 23:59:33 GMT+0800 (China Standard Time)

The main difference is that it avoids state explosion by not doing the NFA-to-DFA transformation anymore. Instead, it correctly implements Aho-Corasick and only eliminates multi-step fallbacks by merging them into the edges of the final graph. I also implemented a graphviz dot dump that helps visualising the generated tree, so you can take a look at what you get.

scoder · Answer 18 · Fri Jan 22 2016 20:27:03 GMT+0800 (China Standard Time)

Any comments on the new implementation?

Alex Olieman · Answer 19 · Mon Jan 25 2016 18:37:32 GMT+0800 (China Standard Time)

Despite my excitement, I haven't had a chance to try the new implementation. I'll likely find the time next week.

Alex Olieman · Answer 20 · Thu Mar 03 2016 05:03:06 GMT+0800 (China Standard Time)

@scoder the improvements are amazing! Sorry that it took me so long to find this out.

The Cython implementation now performs excellently in terms of time and space complexity. I ran some tests with a set of 200k keywords that failed to build with the previous implementation. On my laptop the tree builds in 10-20s, using ~600mb of ram.

Just for kicks I also tested the pure Python implementation. Space complexity is still an issue here, but I was positively surprised to find that the tree can be built at all. This takes about 90s and takes ~7gb ram in my case. There was, however, no discernible difference in runtime with Cython (~1s to find 1000 matches in the same number of sentences). This is not what I expected, given the docs, earlier benchmarks, and common sense. (edit: I'm not certain that the search used pure python after all)

In conclusion: thank you very much @scoder! And I think the rewrite deserves to be on PyPI.

Alex Olieman · Answer 21 · Thu Mar 17 2016 21:45:41 GMT+0800 (China Standard Time)

This issue can be closed. @scoder would you mind doing a new release with the improvements?

scoder · Answer 22 · Fri Mar 18 2016 02:22:58 GMT+0800 (China Standard Time)

Released as acora 2.0.