apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cpc/hll - hardcoded hash function

jfz opened this issue · comments

I noticed that both CpcSketch and HllSketch use MurmurHash3 hashing, and it's hardcoded. In some cases, inputs may have been hashed already for other purposes, or a faster hash function may be preferred. It would be nice if there's a way to specify the hash function(constructor/setter/builder etc.), and let MurmurHash3 be the default.

The MurmurHash3 is hard-coded for several reasons.

  • Historical compatibility. Many users of these sketches keep the binary images of these sketches for many years instead of having to keep the raw data that was used to generate the sketches, which would be many orders-of-magnitude larger. If you change the hash function, it would render all of the stored history useless. Keeping the same hash function means you can merge sketches generated years ago with sketches generated today.

  • In large companies, different departments may adopt sketches at different times and without knowledge of what other departments are doing. If the two departments happen to choose different hash functions, the sketches from the two departments can never be merged. If these two departments are merged together by the company for organizational reasons, you would have a real mess on your hands. This also applies to different companies that are brought together with a corporate merger.

  • We have done extensive testing of the MurmurHash3 and of our sketches using MurmurHash3 in terms of accuracy. Because of these extensive characterization studies, we can claim with confidence what the error properties and error distribution looks like over a very wide range of inputs. If we allowed the user to use his/her own hash function we could not make the same accuracy claims with the same confidence. Clearly, there are other good hash functions out there, but there are many lousy ones as well. Fully characterizing the error properties of these sketches using a specific hash function is hard work and takes a long time.

  • At the time that we chose the MurmurHash3, it was one of the fastest hash functions that also had excellent avalanche behavior, bit-independence and good Crush results. Since then, other faster hash functions have appeared, (e.g., xxHash) that are faster (although only 64 bits, while the MM3 hash we use is 128 bits). Nonetheless, the MM3 hash is quite fast as we have clocked it at about 5-6 nS. Rehashing a foreign hash is also quite fast, so this should not be a problem. It has been our experience that for large systems, the speed of updating the sketch, which uses the hash function, is not a major performance issue. Rather, it is merge time, which does not involve the hash function, that is of much greater importance. Thus, spending time and resources to improve the update speed performance would not yield much benefit at the system level.

Lee.

Thank you for the context, fair points. I don't see it becoming a bottleneck yet for our use case and looks like MM3 works pretty well too. Would circle back if I see a need to push it further.