glitzflitz / pyxorfilter

Python bindings for xorfilter(faster and smaller than bloom and cuckoo filters)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pyxorfilter

Python bindings for C implementation of Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters and of Binary Fuse Filters: Fast and Smaller Than Xor Filters.

Installation

pip install pyxorfilter

From Source

git clone --recurse-submodules https://github.com/glitzflitz/pyxorfilter
cd pyxorfilter
python setup.py build_ext
python setup.py install

Usage

>>> from pyxorfilter import Xor8, Xor16, Fuse8, Fuse16
>>> filter = Xor8(5)	#or Xor16(size)
>>> #Supports unicode strings and heterogeneous types
>>> test_str = ["あ","अ", 51, 0.0, 12.3]
>>> filter.populate(test_str)
True
>>> filter.contains("अ")
True
>>> filter[51]  #You can use __getitem__ instead of contains
True
>>> filter["か"]
False
>>> filter.contains(150)
False
>>> filter.size_in_bytes()
60

You can serialize a filter with the serialize() method which returns a buffer, and you can recover the filter with the deserialize(buffer) method, which returns a filter:

> f = open('/tmp/output', 'wb')
> f.write(filter.serialize())
> f.close()
> recoverfilter = Xor8.deserialize(open('/tmp/output', 'rb').read())

Caveats

Accuracy

For more accuracy(less false positives) use larger but more accurate Xor16 for Fuse16.

For large sets (contain millions of keys), Fuse8/Fuse16 filters are faster and smaller than Xor8/Xor16.

>>> filter = Xor8(1000000)
>>> filter.size_in_bytes()
1230054
>>> filter = Fuse8(1000000)
>>> filter.size_in_bytes()
1130536

TODO

  • Add unit tests
  • Add CI support for distributing pyxorfilter with PyPI.

Links

About

Python bindings for xorfilter(faster and smaller than bloom and cuckoo filters)

License:Apache License 2.0


Languages

Language:Python 100.0%