scoder / acora

Fast multi-keyword search engine for text strings

Home Page:http://pypi.python.org/pypi/acora

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fused types

orbisvicis opened this issue · comments

Would it be possible to fuse the _unicode and _byte functions? Apart from fetching the next element (and technically the hard-coded sizeof's), both cython code paths are identical, I think. The pure-python is 100% identical. I'm not familiar with cython and the code seems more complicated than the standard ahocorasick implementation, so maybe not.. I'm thinking of some lookup table matching type -> get_next_element_function. Perhaps then the container type can by extended to any (python) sequence type and the contained type can be any (python) comparable type. Cython automatically selects the fastest type, I think, so str->UCS4, bytes->char, int->int?, bool->bint, ->.

I'd also like to support the c bitarray extension module, which is a char*-backed List[Bool], without the intermediate boxing of bit to python boolean. Any ideas?

Good question. I remember thinking about merging the two at some point, but given how critical the performance is here, ended up optimising them separately. I even recall making them more similar again at some point, but that was already a while ago...

I would expect bitarray to export a buffer, which could then be unpacked and used in acora. Since the bytes type (and Python's bytearray, array.array, memoryview and others) supports the buffer interface as well, it should be enough to switch the current bytes implementation to char[:] buf (or unsigned char[:]?) as input and pass &buf[0] and length.

Want to give it a try?