HyperLogLog accuracy

Question

HyperLogLog accuracy

gingerlime opened this issue 4 years ago · comments

I created gimel ~4 years ago, and have been using it since, mostly for small-scale A/B tests. However, as our tests grew, it became apparent that results weren't always accurate. This became more apparent when the participants reached dozens of thousands or more... We noticed that our sample sizes "drifted" a bit too far from each other, when the randomized assignment of Alephbet worked correctly.

Investigating this problem further, it became clear that the drift is due to the inherent inaccuracy of HyperLogLog (HLL). We did some simulations and observed that above ~40,000 the counters started skewing for us. This is inline with the Redis implementation.

This means that smaller tests are still ok to run, but tests with larger sample sizes became hard to measure.

The good news? I'm working on a fork of gimel called lamed which should sidestep these issues. The new approach still uses redis, but without HLL.

There's no free lunch though, so the new approach has different trade-offs:

It's not as space-efficient as gimel. You'll need more memory.
duplicate uuids can be detected within a specific time window (defaults to 24 hours). The larger the time window, the more memory is needed. The smaller the window, the more memory efficient it is.
Accuracy should be far higher than HLL

Initial simulations with v4 UUIDs use ~150mb memory per 1 million unique track requests. So you can plan accordingly. i.e. if you have 1 million track requests per day, you can keep memory to below 150mb if you set the time window to 24 hours. If you can afford more memory, you can increase your time window. If you want to save memory, you can reduce it, etc.

Please check out https://github.com/Alephbet/lamed. Feedback is welcome.