Contents
pyahocorasick is a Python module implements two kinds of data structures: trie and Aho-Corasick string matching automaton.
Trie is a dictionary indexed by strings, which allow to retrieve associated items in a time proportional to string length. Aho-Corasick automaton allows to find all occurrences of strings from given set in a single run over text.
(BTW in order to use Aho-Corasick automaton, a trie have to be created; this is the reason why these two distinct entities exist in a single module.)
There are two versions:
- C extension, compatible only with Python3;
- pure python module, compatible with Python 2 and 3.
Python module API is similar, but isn't exactly the same as C extension.
Library is licensed under very liberal two-clauses BSD license. Some portions has been realased into public domain.
Full text of license is available in LICENSE file.
Wojciech Muła, wojciech_mula@poczta.onet.pl
- Module ahocorasick by Danny Yoo --- seems unmaintained (last update in 2005) and is licensed under GPL.
- Article about different trie representations --- this is an effect of experiments I made while working on this module.
Just run:
python setup.py install
If compilation succed, module is ready to use.
Module ahocorasick
contains several constants and
class Automaton
.
Type of strings accepted and returned by Automaton
methods
can be either unicode or bytes, depending on compile time
settings (preprocessor definition AHOCORASICK_UNICODE
). Value
of module member unicode
informs about chosen type.
Warning
If unicode is selected, then trie stores 2 or even 4 bytes per letter, depending on Python settings. If bytes are selected, then just one byte per letter is needed.
unicode
--- see Unicode and bytesSTORE_ANY
,STORE_INTS
,STORE_LENGTH
--- see ConstructorEMPTY
,TRIE
,AHOCORASICK
--- see MembersMATCH_EXACT_LENGTH
,MATCH_AT_MOST_PREFIX
,MATCH_AT_LEAST_PREFIX
--- see description of method keys
Automaton
class is pickable (implements __reduce__()
).
kind
[readonly]One of values:
EMPTY
- There are no words saved in automaton.
TRIE
- There are some words, but methods related to Aho-Corasick algorithm
(
find_all
,iter
) won't work. AHOCORASICK
- Aho-Corasick automaton has been constructed, full functionality is available for user.
Kind is maintained internally by
Automaton
object. Some methods are not available when automaton kind isEMPTY
or isn't anAHOCORASICK
. When called then exception is raised, however testing this property could be better (faster, more elegant).store
[readonly]- Type of values stored in trie. By default
STORE_ANY
is used, thus any python object could be used. WhenSTORE_INTS
orSTORE_LENGTH
is used then values are 32-bit integers and do not occupy additional memory. Seeadd_word
description for details.
Constructor accepts just one argument, a type of values, one of constants:
STORE_ANY
- Any Python object (default).
STORE_LENGTH
- Length of string.
STORE_INTS
- 32-bit integers.
get(word[, default])
- Returns value associated with
word
. RaisesKeyError
or returnsdefault
value ifword
isn't present in dictionary.
keys([prefix, [wildchar, [how]]]) => yield strings
Returns iterator that iterate through words.
If
prefix
(a string) is given, then only words sharing this prefix are yielded.If
wildchar
(single character) is given, then prefix is treated as a simple pattern with selected wildchar. Optional parameterhow
controls which strings are matched:MATCH_EXACT_LENGTH
[default]- Only strings with the same length as a pattern's length are yielded. In other words, literally match a pattern.
MATCH_AT_LEAST_PREFIX
- Strings that have length greater or equal to a pattern's length are yielded.
MATCH_AT_MOST_PREFIX
- Strings that have length less or equal to a pattern's length are yielded.
See Example 2.
values([prefix, [wildchar, [how]]]) => yield object
- Return iterator that iterate through values associated with words.
Words are matched as in
keys
method. items([prefix, [wildchar, [how]]]) => yield tuple (string, object)
- Return iterator that iterate through words and associated values.
Words are matched as in
keys
method. iter()
protocol- Equivalent to
obj.keys()
len()
protocol- Returns number of distinct words.
add_word(word, [value]) => bool
Add new
word
, a key, to dictionary and associate withvalue
. Returns True ifword
didn't exists earlier in dictionary.If
store == STORE_LENGTH
thenvalue
is not allowed ---len(word)
is saved.If
store == STORE_INTS
thenvalue
is optional. If present, then have to be an integer, otherwise defaults tolen(automaton)
.If
store == STORE_ANY
thenvalue
is required and could be any object.This method invalidates all iterators only if new word was added (i.e. method returned True).
clear() => None
Removes all words from dictionary.
This method invalidates all iterators.
exists(word) => bool
orword in ...
- Returns if word is present in dictionary.
match(word) => bool
- Returns if there is a prefix (or word) equal to
word
. For example if word "example" is present in dictionary, then allmatch("e")
,match("ex")
, ...,match("exampl")
,match("example")
are True. Butexists()
is True just for the last word. longest_prefix(word) => integer
- Returns length of the longest prefix of word that exists in a dictionary.
make_automaton()
Creates Aho-Corasick automaton based on trie. This doesn't require additional memory. After successful creation
kind
becomeAHOCORASICK
.This method invalidates all iterators.
find_all(string, callback, [start, [end]])
Perform Aho-Corsick on string;
start
/end
can be used to reduce string range. Callback is called with two arguments:- index of end of matched string
- value associated with that string
(Method called with
start
/end
does similar job asfind_all(string[start:end], callback)
, except index values).iter(string, [start, [end]])
Returns iterator (object of class AutomatonSearchIter) that does the same thing as
find_all
, yielding tuples instead of calling a user function.find_all
method could be expressed as:def find_all(self, string, callback): for index, value in self.iter(string): callback(index, value)
dump() => (list of nodes, list of edges, list of fail links)
Returns 3 lists describing a graph:
- nodes: each item is a pair (node id, end of word marker)
- edges: each item is a triple (node id, label char, child node id)
- fail: each item is a pair (source node id, node if connected by fail node)
ID is a unique number and a label is a single byte.
Module package contains also program
dump2dot.py
that shows how to convertdump
results to input file for graphviz tools.get_stats() => dict
Returns dictionary containing some statistics about underlaying trie:
nodes_count
--- total number of nodeswords_count
--- same aslen(automaton)
longest_word
--- length of the longest wordlinks_count
--- number of edgessizeof_node
--- size of single node in bytestotal_size
--- total size of trie in bytes (aboutnodes_count * size_of node + links_count * size of pointer
). The real size occupied by structure could be larger, because of internal memory fragmentation occurred in memory manager.
Class isn't available directly, object of this class is returned
by iter
method of Automaton
. Iterator has additional method.
set(string, [reset]) => None
Sets new string to process. When
reset
isFalse
(default), then processing is continued, i.e internal state of automaton and index aren't touched. This allow to process larger strings in chunks, for example:it = automaton.iter(b"") while True: buffer = receive(server_address, 4096) if not buffer: break it.set(buffer) for index, value in it: print(index, '=>', value)
When
reset
isTrue
then processing is restarted. For example this code:for string in set: for index, value in automaton.iter(string) print(index, '=>', value)
Does the same job as:
it = automaton.iter(b"") for string in set: it.set(it, True) for index, value in it: print(index, '=>', value)
>>> import ahocorasick >>> A = ahocorasick.Automaton() # add some words to trie >>> for index, word in enumerate("he her hers she".split()): ... A.add_word(word, (index, word)) # test is word exists in set >>> "he" in A True >>> "HER" in A False >>> A.get("he") (0, 'he') >>> A.get("she") (3, 'she') >>> A.get("cat", "<not exists>") '<not exists>' >>> A.get("dog") Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError >>> # convert trie to Aho-Corasick automaton A.make_automaton() # then find all occurrences in string for item in A.iter("_hershe_"): ... print(item) ... (2, (0, 'he')) (3, (1, 'her')) (4, (2, 'hers')) (6, (3, 'she')) (6, (0, 'he'))
Demonstration of keys behaviour.
>>> import ahocorasick >>> A = ahocorasick.Automaton() # add some words to trie >>> for index, word in enumerate("cat catastropha rat rate bat".split()): ... A.add_word(word, (index, word)) # prefix >>> list(A.keys("cat")) ["cat", "catastropha"] # pattern >>> list(A.keys("?at", "?", ahocorasick.MATCH_EXACT_LENGTH)) ["bat", "cat", "rat"] >>> list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX)) ["bat", "cat", "rat", "rate"] >>> list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX)) ["rate"]