Tunable error rate

Question

Tunable error rate

nathanhack opened this issue 4 years ago · comments

It would be great to have a way to change/tune the false negative rate.

Thomas Mueller · Answer 1 · Fri Sep 11 2020 03:50:15 GMT+0800 (China Standard Time)

I assume you mean false positive rate, right? I know some libraries provide functions to e.g. calculate the bits per entry for a false positive rate, or the false positive rate for a given bits per entry, or the size of the filter (memory used) from rate and entry count, and so on. Yes I agree it would be a good addition!

nathanhack · Answer 2 · Fri Sep 11 2020 03:56:59 GMT+0800 (China Standard Time)

Sorry, yes, I meant false positive rate.

nathanhack · Answer 3 · Fri Sep 11 2020 07:46:11 GMT+0800 (China Standard Time)

In particular, I'm interested in the case of setting a fixed size and letting the positive rate fluctuate as the keys are inserted.
@thomasmueller do you have any suggestion on how the repo might be changed to include such a feature? As time permits, I'll work on it and if/when I get it done I can put in a PR if that is something the repo owners would like.

Thomas Mueller · Answer 4 · Fri Sep 11 2020 15:40:28 GMT+0800 (China Standard Time)

setting a fixed size and letting the positive rate fluctuate as the keys are inserted.

A xor filter, unlike a Bloom filter, is built once all keys are present. A Bloom filter allows to add entries afterwards, but a xor filter does not. So what you are looking for might not be possible I'm afraid...

nathanhack · Answer 5 · Fri Sep 11 2020 21:12:42 GMT+0800 (China Standard Time)

Yeah to clarify, I indeed meant all at once, such that you set the size and let the error rate suffer.
After briefly looking at the paper, I would hazard a guess the main issue would be how to choose the hashing functions? Because, if I understand it correctly we basically keep trying new hashing until we get everything in, but with a fixed size this would always fail. If true then the simplest approach would be to keep track of the best set of hashes and at the end of MaxIterations, just use the best hashes. That way if the user wishes to continue trying their luck they can just increase MaxIterations. It seems like there should be some way to calculate the expected error rate based on the fixed size but I don't recall seeing that explicitly written down in the paper (maybe it was or maybe it could be derived but I honestly don't know it would require a through reading).

Thomas Mueller · Answer 6 · Fri Sep 11 2020 23:28:19 GMT+0800 (China Standard Time)

I indeed meant all at once

Ah ok! Yes that works then.

you set the size and let the error rate suffer.

The implementation here (the current implementation) only supports 8 bit fingerprints (uint8). It doesn't support smaller or larger fingerprints right now. So I'm afraid this doesn't currently work... It's possible to add more implementations, but we believe it's not currently worth it... Of course feel free to provide a patch!

increase MaxIterations

No I'm afraid that's not a solution... What you could change a tiny bit is this line:

capacity := 32 + uint32(math.Ceil(1.23*float64(size)))

and, for example, try to use 1.229 instead of 1.23. It would work sometimes. But probability to find a seed reduces drastically, and very quickly... It's not something I would try. With a regular Bloom filter, yes you can just make it smaller and it will still "work" (just with a bad false positive rate), but not a xor filter.

Brian Olson · Answer 7 · Fri Apr 09 2021 05:51:12 GMT+0800 (China Standard Time)

I implemented xor16 and I think I got it generalized also to xorN (for [9..32] bits).
https://github.com/brianolson/xorfilter/tree/xor16
There's some other stuff in there about maintaining a fork in case this work didn't make it back upstream, but that shouldn't be hard to un-diff.

Daniel Lemire · Answer 8 · Fri Jun 18 2021 23:24:23 GMT+0800 (China Standard Time)

@brianolson Your generic implementation seems to set the memory usage to 32-bit per entry. I understand that you can write only the least significant N bits of each word to disk or to the network, and then upsample when deserializing... but it comes at a cost.