koron / hyperminhash

HyperMinHash: Bringing intersections to HyperLogLog

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HyperMinSketch

Besides being a compact and pretty speedy HyperLogLog implementation for cardinality counting, this modified HyperLogLog allows intersection and similarity estimation of different HyperLogLogs.

Details

A simple implementation of HyperLogLog (LogLog-Beta to be specific):

  • 16 bit registers instead of 6 bit, the new 10 bit are for b-bit signatures
  • Similarity function estimates Jaccard indices (a number between 0-1) of 0.01 for set cardinalities on the order of 1e9 with accuracy around 5%
  • Intersection applies the Jaccard index on the union of the sets to return the intersecting set cardinality

The work is based on "HyperMinHash: Jaccard index sketching in LogLog space - Yun William Yu, Griffin M. Weber"

Example Usage

sk1 := hyperminhash.New()
sk2 := hyperminhash.New()

for i := 0; i < 10000; i++ {
    sk1.Add([]byte(strconv.Itoa(i)))
}

sk1.Cardinality() // 10001 (should be 10000)

for i := 3333; i < 23333; i++ {
    sk2.Add([]byte(strconv.Itoa(i)))
}

sk2.Cardinality()     // 19977 (should be 20000)
sk1.Similarity(sk2)   // 0.284589082 (should be 0.2857326533)
sk1.Intersection(sk2) // 6623 (should be 6667)

sk1.Merge(sk2)
sk1.Cardinality() // 23271 (should be 23333)

Results

Max Cardinality 1000

Set1 HMH1 HLL1 Set2 HMH2 HLL2 S1 ∪ S2 HMH1 ∪ HMH2 HLL1 ∪ HLL2 S1 ∩ S2 HMH1 ∩ HMH2 HLL1+HLL2-(HLL1∪HLL2)
27 27 27 361 363 361 374 376 374 14 (3.743316%) 14 (3.723404%) 14 (3.743316%)
629 634 629 273 275 273 693 697 693 209 (30.158730%) 211 (30.272597%) 209 (30.158730%)
705 709 705 642 645 642 708 712 708 639 (90.254237%) 643 (90.308989%) 639 (90.254237%)
212 212 212 766 766 766 927 929 927 51 (5.501618%) 51 (5.489774%) 51 (5.501618%)
966 969 966 799 797 798 1421 1426 1420 344 (24.208304%) 346 (24.263675%) 344 (24.225352%)

Max Cardinality 10000

Set1 HMH1 HLL1 Set2 HMH2 HLL2 S1 ∪ S2 HMH1 ∪ HMH2 HLL1 ∪ HLL2 S1 ∩ S2 HMH1 ∩ HMH2 HLL1+HLL2-(HLL1∪HLL2)
1405 1410 1404 9929 9867 9929 11194 11152 11279 140 (1.250670%) 154 (1.380918%) 54 (0.478766%)
9020 9024 9062 2827 2827 2827 10565 10559 10630 1282 (12.134406%) 1327 (12.567478%) 1259 (11.843838%)
2297 2310 2296 3896 3868 3899 5557 5526 5574 636 (11.445024%) 628 (11.364459%) 621 (11.141012%)
6015 5967 6055 621 616 621 6287 6242 6305 349 (5.551137%) 335 (5.366870%) 371 (5.884219%)
4136 4123 4123 6006 5990 5975 10076 10080 10141 66 (0.655022%) 65 (0.644841%) 0 (0.000000%)

Max Cardinality 100000

Set1 HMH1 HLL1 Set2 HMH2 HLL2 S1 ∪ S2 HMH1 ∪ HMH2 HLL1 ∪ HLL2 S1 ∩ S2 HMH1 ∩ HMH2 HLL1+HLL2-(HLL1∪HLL2)
60687 59745 60356 98707 98199 99138 109123 108599 109211 50271 (46.068198%) 49515 (45.594342%) 50283 (46.042065%)
3958 3944 3946 9674 9630 9688 13505 13460 13619 127 (0.940392%) 132 (0.980684%) 15 (0.110140%)
67549 66446 67113 98052 97647 98513 133576 132744 133730 32025 (23.975115%) 31448 (23.690713%) 31896 (23.851043%)
76325 75382 75954 20484 20366 20462 83842 83288 83875 12967 (15.465996%) 13161 (15.801796%) 12541 (14.952012%)
71530 70369 71257 28544 28416 28585 88737 88209 88830 11337 (12.775956%) 11198 (12.694850%) 11012 (12.396713%)

Max Cardinality 1000000

Set1 HMH1 HLL1 Set2 HMH2 HLL2 S1 ∪ S2 HMH1 ∪ HMH2 HLL1 ∪ HLL2 S1 ∩ S2 HMH1 ∩ HMH2 HLL1+HLL2-(HLL1∪HLL2)
142246 141517 142793 332708 329629 331308 356575 353214 353774 118379 (33.198906%) 117967 (33.398167%) 120327 (34.012392%)
114564 113816 114643 389979 386335 387814 463990 458454 459780 40553 (8.740059%) 41412 (9.032967%) 42677 (9.282048%)
505829 503076 501456 529941 532761 533367 891914 897115 889646 143856 (16.128909%) 148236 (16.523634%) 145177 (16.318513%)
35997 35747 36232 600696 598130 596847 626071 625512 659381 10622 (1.696613%) 10302 (1.646971%) 0 (0.000000%)
476717 472011 470168 830577 829483 829520 1125584 1128288 1119286 181710 (16.143620%) 187679 (16.633962%) 180402 (16.117596%)

Max Cardinality 10000000

Set1 HMH1 HLL1 Set2 HMH2 HLL2 S1 ∪ S2 HMH1 ∪ HMH2 HLL1 ∪ HLL2 S1 ∩ S2 HMH1 ∩ HMH2 HLL1+HLL2-(HLL1∪HLL2)
847686 848830 843129 1580122 1567181 1565597 2005564 1999074 1996073 422244 (21.053629%) 416263 (20.822791%) 412653 (20.673242%)
8543492 8537572 8468954 2118751 2132908 2105979 8706935 8713595 8637847 1955308 (22.456904%) 1954912 (22.435195%) 1937086 (22.425565%)
6416419 6447411 6383130 6136246 6157177 6087945 6630774 6642118 6586601 5921891 (89.309197%) 5930683 (89.289034%) 5884474 (89.340071%)
3115170 3098484 3125659 6531209 6538291 6486154 9087145 9084715 9062878 559234 (6.154122%) 559115 (6.154458%) 548935 (6.056961%)
2796497 2773075 2811513 4481637 4456172 4506376 4567520 4541818 4596233 2710614 (59.345422%) 2671681 (58.824044%) 2721656 (59.214927%)

Max Cardinality 100000000

Set1 HMH1 HLL1 Set2 HMH2 HLL2 S1 ∪ S2 HMH1 ∪ HMH2 HLL1 ∪ HLL2 S1 ∩ S2 HMH1 ∩ HMH2 HLL1+HLL2-(HLL1∪HLL2)
92677880 90564361 91316783 68555852 68194738 68095644 116860375 115203878 115336268 44373357 (37.971260%) 42887226 (37.227242%) 44076159 (38.215350%)
40923468 41078856 40516294 40079654 39978480 39838181 47005537 47260934 46213830 33997585 (72.326767%) 34182014 (72.326150%) 34140645 (73.875385%)
42896835 43112441 42366207 54500764 53829285 54259699 76119877 75033128 74855461 21277722 (27.952912%) 21248656 (28.319033%) 21770445 (29.083309%)
87606825 85822057 86100193 73453201 74492482 73813694 150865349 150805169 149495773 10194677 (6.757468%) 9987352 (6.622685%) 10418114 (6.968835%)
56331609 56528002 55351266 60043260 58790428 59532480 97016026 94933407 95451534 19358843 (19.954273%) 18455020 (19.439964%) 19432212 (20.358198%)

About

HyperMinHash: Bringing intersections to HyperLogLog

License:MIT License


Languages

Language:Go 100.0%