Besides being a compact and pretty speedy HyperLogLog implementation for cardinality counting, this modified HyperLogLog allows intersection and similarity estimation of different HyperLogLogs.
A simple implementation of HyperLogLog (LogLog-Beta to be specific):
16 bit registers instead of 6 bit, the new 10 bit are for b-bit signatures
Similarity function estimates Jaccard indices (a number between 0-1) of 0.01 for set cardinalities on the order of 1e9 with accuracy around 5%
Intersection applies the Jaccard index on the union of the sets to return the intersecting set cardinality
The work is based on "HyperMinHash: Jaccard index sketching in LogLog space - Yun William Yu, Griffin M. Weber"
sk1 := hyperminhash .New ()
sk2 := hyperminhash .New ()
for i := 0 ; i < 10000 ; i ++ {
sk1 .Add ([]byte (strconv .Itoa (i )))
}
sk1 .Cardinality () // 10001 (should be 10000)
for i := 3333 ; i < 23333 ; i ++ {
sk2 .Add ([]byte (strconv .Itoa (i )))
}
sk2 .Cardinality () // 19977 (should be 20000)
sk1 .Similarity (sk2 ) // 0.284589082 (should be 0.2857326533)
sk1 .Intersection (sk2 ) // 6623 (should be 6667)
sk1 .Merge (sk2 )
sk1 .Cardinality () // 23271 (should be 23333)
Set1
HMH1
HLL1
Set2
HMH2
HLL2
S1 ∪ S2
HMH1 ∪ HMH2
HLL1 ∪ HLL2
S1 ∩ S2
HMH1 ∩ HMH2
HLL1+HLL2-(HLL1∪HLL2)
27
27
27
361
363
361
374
376
374
14 (3.743316%)
14 (3.723404%)
14 (3.743316%)
629
634
629
273
275
273
693
697
693
209 (30.158730%)
211 (30.272597%)
209 (30.158730%)
705
709
705
642
645
642
708
712
708
639 (90.254237%)
643 (90.308989%)
639 (90.254237%)
212
212
212
766
766
766
927
929
927
51 (5.501618%)
51 (5.489774%)
51 (5.501618%)
966
969
966
799
797
798
1421
1426
1420
344 (24.208304%)
346 (24.263675%)
344 (24.225352%)
Set1
HMH1
HLL1
Set2
HMH2
HLL2
S1 ∪ S2
HMH1 ∪ HMH2
HLL1 ∪ HLL2
S1 ∩ S2
HMH1 ∩ HMH2
HLL1+HLL2-(HLL1∪HLL2)
1405
1410
1404
9929
9867
9929
11194
11152
11279
140 (1.250670%)
154 (1.380918%)
54 (0.478766%)
9020
9024
9062
2827
2827
2827
10565
10559
10630
1282 (12.134406%)
1327 (12.567478%)
1259 (11.843838%)
2297
2310
2296
3896
3868
3899
5557
5526
5574
636 (11.445024%)
628 (11.364459%)
621 (11.141012%)
6015
5967
6055
621
616
621
6287
6242
6305
349 (5.551137%)
335 (5.366870%)
371 (5.884219%)
4136
4123
4123
6006
5990
5975
10076
10080
10141
66 (0.655022%)
65 (0.644841%)
0 (0.000000%)
Set1
HMH1
HLL1
Set2
HMH2
HLL2
S1 ∪ S2
HMH1 ∪ HMH2
HLL1 ∪ HLL2
S1 ∩ S2
HMH1 ∩ HMH2
HLL1+HLL2-(HLL1∪HLL2)
60687
59745
60356
98707
98199
99138
109123
108599
109211
50271 (46.068198%)
49515 (45.594342%)
50283 (46.042065%)
3958
3944
3946
9674
9630
9688
13505
13460
13619
127 (0.940392%)
132 (0.980684%)
15 (0.110140%)
67549
66446
67113
98052
97647
98513
133576
132744
133730
32025 (23.975115%)
31448 (23.690713%)
31896 (23.851043%)
76325
75382
75954
20484
20366
20462
83842
83288
83875
12967 (15.465996%)
13161 (15.801796%)
12541 (14.952012%)
71530
70369
71257
28544
28416
28585
88737
88209
88830
11337 (12.775956%)
11198 (12.694850%)
11012 (12.396713%)
Set1
HMH1
HLL1
Set2
HMH2
HLL2
S1 ∪ S2
HMH1 ∪ HMH2
HLL1 ∪ HLL2
S1 ∩ S2
HMH1 ∩ HMH2
HLL1+HLL2-(HLL1∪HLL2)
142246
141517
142793
332708
329629
331308
356575
353214
353774
118379 (33.198906%)
117967 (33.398167%)
120327 (34.012392%)
114564
113816
114643
389979
386335
387814
463990
458454
459780
40553 (8.740059%)
41412 (9.032967%)
42677 (9.282048%)
505829
503076
501456
529941
532761
533367
891914
897115
889646
143856 (16.128909%)
148236 (16.523634%)
145177 (16.318513%)
35997
35747
36232
600696
598130
596847
626071
625512
659381
10622 (1.696613%)
10302 (1.646971%)
0 (0.000000%)
476717
472011
470168
830577
829483
829520
1125584
1128288
1119286
181710 (16.143620%)
187679 (16.633962%)
180402 (16.117596%)
Set1
HMH1
HLL1
Set2
HMH2
HLL2
S1 ∪ S2
HMH1 ∪ HMH2
HLL1 ∪ HLL2
S1 ∩ S2
HMH1 ∩ HMH2
HLL1+HLL2-(HLL1∪HLL2)
847686
848830
843129
1580122
1567181
1565597
2005564
1999074
1996073
422244 (21.053629%)
416263 (20.822791%)
412653 (20.673242%)
8543492
8537572
8468954
2118751
2132908
2105979
8706935
8713595
8637847
1955308 (22.456904%)
1954912 (22.435195%)
1937086 (22.425565%)
6416419
6447411
6383130
6136246
6157177
6087945
6630774
6642118
6586601
5921891 (89.309197%)
5930683 (89.289034%)
5884474 (89.340071%)
3115170
3098484
3125659
6531209
6538291
6486154
9087145
9084715
9062878
559234 (6.154122%)
559115 (6.154458%)
548935 (6.056961%)
2796497
2773075
2811513
4481637
4456172
4506376
4567520
4541818
4596233
2710614 (59.345422%)
2671681 (58.824044%)
2721656 (59.214927%)
Max Cardinality 100000000
Set1
HMH1
HLL1
Set2
HMH2
HLL2
S1 ∪ S2
HMH1 ∪ HMH2
HLL1 ∪ HLL2
S1 ∩ S2
HMH1 ∩ HMH2
HLL1+HLL2-(HLL1∪HLL2)
92677880
90564361
91316783
68555852
68194738
68095644
116860375
115203878
115336268
44373357 (37.971260%)
42887226 (37.227242%)
44076159 (38.215350%)
40923468
41078856
40516294
40079654
39978480
39838181
47005537
47260934
46213830
33997585 (72.326767%)
34182014 (72.326150%)
34140645 (73.875385%)
42896835
43112441
42366207
54500764
53829285
54259699
76119877
75033128
74855461
21277722 (27.952912%)
21248656 (28.319033%)
21770445 (29.083309%)
87606825
85822057
86100193
73453201
74492482
73813694
150865349
150805169
149495773
10194677 (6.757468%)
9987352 (6.622685%)
10418114 (6.968835%)
56331609
56528002
55351266
60043260
58790428
59532480
97016026
94933407
95451534
19358843 (19.954273%)
18455020 (19.439964%)
19432212 (20.358198%)