gyson / blex

Fast Bloom filter with concurrent accessibility, powered by :atomics module.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

false_positive_probability not holding for large bloomfilter

rhettlunn opened this issue · comments

I was expecting the False Positive rate to stay below the level the bloomfilter was initialized with until the bloomfilter reaches capacity, but instead it vastly exceeds it.

Create a bloomfilter with 400 million capacity, FPP of 0.001, put 100 million things into the bloomfilter, then test 1 million things that weren't put into the bloomfilter.

I expected to find 1,000 or less false positives, but instead I find 23,242.

iex(1)> b = Blex.new(400000000, 0.001)
%Blex{
  a: #Reference<0.3778163223.2071855107.97792>,
  b: 30,
  hash_fn: #Function<4.114810444/2 in Blex.get_hash_fn/1>,
  hash_id: 202,
  k: 10,
  m: 1073741824
}
iex(2)> 1..100000000 |> Enum.map(fn x -> Blex.put(b, x) end)
[:ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok,
 :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok,
 :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok,
 :ok, :ok, ...]
iex(3)> 101000001..102000000 |> Enum.map(fn x -> Blex.member?(b, x) end) |> Enum.filter(fn x -> x end) |> Enum.count
23242

I get the exact same false positive count when I initialize the bloomfilter with a capacity of 1.6 billion.

While experimenting I also tried creating a bloomfilter with capacity of 3.2 billion, but that resulted in a very different error that I'll open a separate issue for.

I have the same issue. I use in production bloomfilter created with b = Blex.new(300000000, 0.0001) and containing 250 million things. Just tested 1 million things and received 56784 false positives.