A fast Bloomfilter implementation in C with a Python wrapper.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".
Works with memory mapped files or pure memory filters thanks to a slightly modifieid mmapf
from the great Brainflayer project.
Super duper fast hashing is provided by xxHash
This has been tested on Ubuntu 18.04, but it should work most places. Windows will need some work for the Python module.
git clone https://github.com/duckythescientist/duckbloom
cd duckbloom
git submodule init && git submodule update
make
This is optional. Installs libduckbloom.so
to /usr/local/lib
and the header to /usr/local/include
.
make install
// C/C++
bloom_ctx * ctx = bloom_malloc();
bloom_open(ctx, "myfile", 4096, 10);
char * mystr = "hello";
bloom_add(ctx, mystr, strlen(mystr));
bool ret = bloom_check(ctx, mystr, strlen(mystr));
# Python
bloom = Bloom("myfile.blm", n=1000, p=1E-6)
bloom.add("hello, world")
bloom.add(b"works with bytes too")
assert bloom.check("hello, world")
The header file is reasonably well documented.
Calculator: https://hur.st/bloomfilter
n = ceil(m / (-k / log(1 - exp(log(p) / k))))
p = pow(1 - exp(-k / (m / n)), k)
m = ceil((n * log(p)) / log(1 / pow(2, log(2))));
k = round((m / n) * log(2));