johnmyleswhite / BloomFilters.jl

Bloom filters in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bloom filter , what wrong ? why false ?

paulanalyst opened this issue · comments

           _

_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease+2599 (2014-04-11 23:52 UTC)
/ |_'|||__'| | Commit bf7096c (51 days old master)
|__/ | x86_64-w64-mingw32

julia> using BloomFilters

julia> n, k = 100, 50
(100,50)

julia> ety=readcsv("etykiety_kli.txt")
56522x1 Array{Any,2}:
"EC00113876"
"EC00085985"
"EC00037297"
"EC00005413"
"EC00126328"
"EC00021867"
"EC00114062"
"EC00007751"
"EC00206892"
"EC00115609"
?
"EC00159409"
"EC00172340"
"EC00062096"
"EC00134183"
"EC00108009"
"EC00050665"
"EC00081817"
"EC00155357"
"EC00031904"
"EC00060934"

julia> filter = BloomFilter(n, k)
A Bloom filter

  • Mask Size: 100
  • Number of Hashes: 50

julia> add!(filter, ety)

julia> x=ety[2]
"EC00085985"

julia> contains(filter,x)
false

Why false ?

julia> findfirst(ety,x)
2

Paul

What's the SHA1 of the version of this package you're using?

BloomFilters,0.0.0

Ok. I think we may need to change the released version.

I downloaded today.

Yes, the released version is not up-to-date.

Try updating now. You need Julia 0.3-.

Ok, updated, BloomFilters,0.0.1

but :
julia> using BloomFilters

julia> n, k = 100, 50
(100,50)

julia> filter = BloomFilter(n, k)
A Bloom filter

  • Mask Size: 100
  • Number of Hashes: 50

julia> ety=readcsv("etykiety_kli.txt")
56522x1 Array{Any,2}:
"EC00113876"
"EC00085985"
"EC00037297"
"EC00005413"
"EC00126328"
"EC00021867"
"EC00114062"
"EC00007751"
"EC00206892"
"EC00115609"
?
"EC00159409"
"EC00172340"
"EC00062096"
"EC00134183"
"EC00108009"
"EC00050665"
"EC00081817"
"EC00155357"
"EC00031904"
"EC00060934"

julia> add!(filter, ety)

julia> x=ety[2]
"EC00085985"

julia> contains(filter,x)
false

julia> findfirst(ety,x)
2

ety is Array{Any,2}: ...

julia> add!(filter, ety)

why :
julia> contains(filter,ety)
true
???

julia> filter
A Bloom filter

  • Mask Size: 100
  • Number of Hashes: 50

julia> ety
56522x1 Array{Any,2}:
"EC00113876"
"EC00085985"
"EC00037297"
"EC00005413"
"EC00126328"
"EC00021867"
"EC00114062"
"EC00007751"
"EC00206892"
"EC00115609"
?
"EC00159409"
"EC00172340"
"EC00062096"
"EC00134183"
"EC00108009"
"EC00050665"
"EC00081817"
"EC00155357"
"EC00031904"
"EC00060934"

julia> contains(filter,vec(ety))
56522-element BitArray{1}:
true
true
true
true
true
true
true
true
true
true
?
true
true
true
true
true
true
true
true
true
true

julia>

after vec
julia> contains(filter,"somethink")
true
:/

Ok. I'll look into this. It might take me a couple of weeks.

is better if data is no to long add!(filter,ety[1:50]) and when n is hi +-10000:
julia> using BloomFilters

julia> n, k = 100, 50
(100,50)

julia> filter = BloomFilter(n, k)
A Bloom filter

  • Mask Size: 100
  • Number of Hashes: 50

julia> ety=readcsv("etykiety_kli.txt")
56522x1 Array{Any,2}:
"EC00113876"
"EC00085985"
"EC00037297"
"EC00005413"
"EC00126328"
"EC00021867"
"EC00114062"
"EC00007751"
"EC00206892"
"EC00115609"
?
"EC00159409"
"EC00172340"
"EC00062096"
"EC00134183"
"EC00108009"
"EC00050665"
"EC00081817"
"EC00155357"
"EC00031904"
"EC00060934"

julia> add!(filter,ety[1:50])

julia> x=ety[2]
"EC00085985"

julia> contains(filter,x)
true

julia> findfirst(ety,x)
2

julia> contains(filter,ety[1])
true

julia> contains(filter,ety[50])
true

julia> contains(filter,ety[51])
true

julia> contains(filter,ety[55])
false

julia> contains(filter,ety[505])
true

julia> using BloomFilters

julia> n, k = 10000, 50
(10000,50)

julia> filter = BloomFilter(n, k)
A Bloom filter

  • Mask Size: 10000
  • Number of Hashes: 50

julia> ety=readcsv("etykiety_kli.txt")
56522x1 Array{Any,2}:
"EC00113876"
"EC00085985"
"EC00037297"
"EC00005413"
"EC00126328"
"EC00021867"
"EC00114062"
"EC00007751"
"EC00206892"
"EC00115609"
?
"EC00159409"
"EC00172340"
"EC00062096"
"EC00134183"
"EC00108009"
"EC00050665"
"EC00081817"
"EC00155357"
"EC00031904"
"EC00060934"

julia> add!(filter,ety[1:50])

julia> x=ety[2]
"EC00085985"

julia> contains(filter,x)
true

julia> findfirst(ety,x)
2

julia> contains(filter,ety[505])
false

julia> contains(filter,ety[50])
true

julia> contains(filter,ety[51])
false

julia> contains(filter,ety[52])
false

julia> contains(filter,ety[53])
false

julia> contains(filter,ety[55])
false

julia> contains(filter,ety[49])
true

julia> contains(filter,ety[48])
true

julia> contains(filter,ety[47])
true

julia>

must be ety[:]
OLD:
julia> filter = BloomFilter(n, k)
A Bloom filter

  • Mask Size: 1000
  • Number of Hashes: 50

julia> add!(filter,ety)

julia> x=ety[2]
"EC00085985"

julia> contains(filter,x)
false

NEW:
julia> filter = BloomFilter(n, k)
A Bloom filter

  • Mask Size: 1000
  • Number of Hashes: 50

julia> add!(filter,ety[:])

julia> x=ety[2]
"EC00085985"

julia> contains(filter,x)
true

Try again with 0.1.0