scoder / acora

Fast multi-keyword search engine for text strings

Home Page:http://pypi.python.org/pypi/acora

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The construction algorithm does not scale well.

aolieman opened this issue · comments

It is noted in the documentation that the construction algorithm is not suitable for more than a couple of thousands of keywords. In my experience, the memory requirements go through the roof at around 10K keywords.

I am, however, quite motivated to (somehow) use Aho-Corasick search in python with > 100K keywords. Several workarounds are possible for my current use-case, but if it is possible to improve the scalability of Acora, I'd much rather spend my time on contributing to that.

So, where do we start?
It seems that tree construction is possible in linear runtime: http://link.springer.com/chapter/10.1007%2F11496656_15

Yes, it's sad that the construction is such a serious bottleneck. The problem is that it a) unfolds the tree from a nondeterministic automaton to a deterministic one using powerset construction, meaning that it suffers from state explosion, and b) does it in Python dicts, which are not a memory efficient representation of a tree. There are certainly tree constructions available that perform substantially better and produce more efficient trees. The paper you referenced seems to provide such a way. There's a pre-print version here, BTW:
http://cs.haifa.ac.il/~landau/gadi/shiri.pdf

The tree construction in acora is done entirely in Python, though. It's in _nfa2dfa.py. If you can get a Python implementation of a better tree construction running, it should be possible to integrate it with the search engine without major changes.

Thanks for the explanation! Good news that the tree construction is Python-only; that gives me a much better chance at providing an alternative than if you had coded it in C.

Guess I have to do some homework about the implementation specifics in the paper I referenced. I'm pretty sure the NFA->DFA transform is the bigger problem of the two you mentioned. Using dictionaries may not be memory-efficient for this purpose, but that alone wouldn't cause the exponential memory requirement that seems to be present now.

Using the method described in the paper should be possible using a Python-only trie, but if we're going for scalability it is probably good to use e.g. marisa-trie. How would you feel about such a dependency? The most important criteria according to me are covered: pip-installable, supports unicode, and can be pickled.

By the way, the current workaround I'm using is to split my set of keywords into subsets of maximum 10K and to construct a separate tree for each. Then, I apply a very basic form of federated search to get results from each tree. It works okay, but selecting the longest matches is a bit of a hassle.

Adding a dependency on marisa-trie might be premature optimisation. And it seems it doesn't support byte strings out of the box, so that would have to be faked with Latin-1 decoding/encoding. I'd say, let's see how far we can get without it.

Splitting the construction is only a work-around. If algorithmic improvements can be implemented, they are largely preferable.

Of course only an algorithmic improvement would be sensible to include in Acora. I just mentioned my workaround for searchers who stumble across this issue report in the meantime. Once I extend the workaround with satisfactory longest overlapping match selection, it could be added as a recipe in the docs though.

Given my current schedule, I'll only get around to giving the alternative tree construction algorithm a serious go after the summer.

I guess you're also right that a dependency on marisa-tree would be premature optimisation. I might as well try implementing the algorithm described in the paper using only basic Python datatypes. Then see where we go from there based on (memory) profiling.

@scoder @aolieman reading the algorithm from @shirki I am under the impression that there is a big hidden constant in the linear construction.
See also http://www.cs.ucr.edu/~stelo/cpm/cpm05/cpm05_4_5_Dori.ppt

You have to:

  1. sort inputs
  2. construct the trie
  3. construct a generalized suffix tree on the reversed inputs
  4. compute the proper ancestors on that suffix tree to create the failure links.
  5. create failure links in 2. from 4. by mapping the tree to the trie

@shirki: am I missing something? Would you have an example of code by chance?

Note that this code is currently being rewritten. I'll soon have a correct implementation available for testing.

@scoder Thanks! ... Unrelated... Do you think that an id could be added to each strings stored in the automaton?

@scoder that's wonderful! Looking forward to your implementation.

@pombredanne regarding the addition of IDs: what would be the use-case? Given that your keywords are unique, I see no problem mapping them to IDs external to the automaton (e.g. using a dict or database).

@aolieman This all depends on your definition of what a keyword is. This can be an arbitrary long string. They do not have to be unique either. For long strings and many of them, storing an external dict with the string as key is a terribly bad solution: at best you end up storing the strings twice, at worst you can exhaust RAM.

@aolieman ... which is why most Trie-like structures allow you to store an id (or more) in a dict-like fashion: pyahocorasick, esmre, datrie, dawg, marisa-trie and most of them use this approach

@pombredanne I see that you've opened #8 to discuss this issue. It's a legitimate request of course, although I wonder what you mean by non-unique keywords.

@aolieman You say keywords, but I think of arbitrary string. Say I want to index the text of a 1000 web pages, each treated as a "keyword" of possibly 1000 words and characters. I may do not know ahead of time if as a coincidence two of these strings are duplicates.

@aolieman another case where id matters would be if you store short words, bona-fide "keywords" in multiple languages, say English and Dutch in the same trie. There may be two words that are spelled out the same in both languages. Knowing if a word is defined in one or both language matters. I would not want to have to "dedupe" each language ahead of time to pick one unique keyword. There again it does not make sense to me space-wise to store the words both in the trie and in another dictionary. If anything, one of the interest of a trie is some compression.

I uploaded the rewrite in a new branch: https://github.com/scoder/acora/tree/rewrite
It's clearly better than the old implementation, so please give it a try. I'm considering to remove some of the backwards compatibility code that now gets somewhat in the way. Can't say yet if that would improve things further, although I would hope so. It might allow for further tuning of the initial tree construction.

@scoder Thanks! Can you highlight how the new implementation works in a nutshell ?

The main difference is that it avoids state explosion by not doing the NFA-to-DFA transformation anymore. Instead, it correctly implements Aho-Corasick and only eliminates multi-step fallbacks by merging them into the edges of the final graph. I also implemented a graphviz dot dump that helps visualising the generated tree, so you can take a look at what you get.

Any comments on the new implementation?

Despite my excitement, I haven't had a chance to try the new implementation. I'll likely find the time next week.

@scoder the improvements are amazing! Sorry that it took me so long to find this out.

The Cython implementation now performs excellently in terms of time and space complexity. I ran some tests with a set of 200k keywords that failed to build with the previous implementation. On my laptop the tree builds in 10-20s, using ~600mb of ram.

Just for kicks I also tested the pure Python implementation. Space complexity is still an issue here, but I was positively surprised to find that the tree can be built at all. This takes about 90s and takes ~7gb ram in my case. There was, however, no discernible difference in runtime with Cython (~1s to find 1000 matches in the same number of sentences). This is not what I expected, given the docs, earlier benchmarks, and common sense. (edit: I'm not certain that the search used pure python after all)

In conclusion: thank you very much @scoder! And I think the rewrite deserves to be on PyPI.

This issue can be closed. @scoder would you mind doing a new release with the improvements?

Released as acora 2.0.