google / guava

Google core libraries for Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fast Hash-function, as for instance xxHash3

JohannesLichtenberger opened this issue · comments

Do you plan on implementing more hash functions? My use case is for a temporal data store, which stores XML or JSON representations in a tree structure (and eventually in a custom binary format on disk). I'm optionally storing hashes for each node, whereas the ancestor chain is always adapted through a rolling hash whenever a node is inserted/deleted/modified. However, I've been using SHA256 truncated to the first 128 bits using BigInteger for some computations of the rolling hash. For sure, the whole approach is really stupid, as I thought maybe to reduce hash collisions, I'd probably need a hardware-accelerated SHA256 hash to reduce the possibility of hash collisions, as I'm using the hashes on the one hand for a diff (which is probably not needed anymore, as I'm now doing change tracking between revisions). On the other hand, for a simple optimistic locking schema, if a subtree changes between a read and a write, abort the transaction...

So I guess, for my use cases, something much faster would be great, for instance, xxHash3. Do you have any plans, or should I use another library for this use case? In general, I read that Guava, regarding hashing doesn't have the fastest implementations, but not sure if that's true. At least the @beta annotation is still around for many years now.

Hey Johannes,

We have a small comparison of the hash functions available here.

tl;dr is that SHA256 is quite slow (it uses Java's MessageDigest under the hood), and not likely to be improved upon. Based on your notes, it sounds like you only need 128 bits, so perhaps murmur3_128() would work better for you? It should be about 3x faster than SHA256 according to our benchmarks.

Also, if you don't need stability across runs, consider just using goodFastHash(128) and we'll give you something "good and fast".

As for plans to implement additional hashing algorithms: we don't have anything on our radar. I'm by no means a hashing expert, but I haven't heard of xxHash3 before. I think we'd need some additional evidence that it would be broadly useful before we'd accept an additional hash functions into the library.

HTH,
-kak

fwiw, I believe the biggest caveat for performance is that the api design results in allocations or similar waste. That's perfectly fine for most use cases that an application programmer might have, but not ideal for performance sensitive cases like java.util.HashMap. Generally, I start with Guava's as a first pass to work out all of the other complex logic. If performance is a concern then profiling will quickly highlight if this is the bottleneck, and if so then I will switch to an inline hash function. Hopefully that rule of thumb answers your question of if Guava's hashing is a good choice for now in your project.

I'll even switch to 64Bit hashes, I guess:

  • I've implemented a simple change tracking between revisions in my data store, so diffing usually, especially for comparisons between consecutive revisions is not needed anymore. In any case, regarding diffing and if hashes are generated for all non-leaf nodes in the tree I'm comparing in preorder the two resources, starting from a given node. If hashes are built and they match the whole subtree of the nodes are skipped and they are considered to be equal (I'm also checking the unique nodeKeys). However, if something in the subtree changed and a rolling hash for all ancestors is updated and a collision might be found, the algorithm will output a false output (I guess now it's possible due to switching to a 64Bit hash and hash collisions are going to happen at some point, as I'm storing nodes by a 64 bit identifier, from which the first 54 bits are used to compute the leaf page in which a node resides and the last 10 bits to compute the offset in the page (stored in a simple trie).
  • In the second use case I'm checking for HTTP-requests, if a client fetched something via a query and afterwards makes an update. If in-between another client changes something in the subtree of the fetched node, all ancestor node hashes are adapted and the hashes hopefully don't match anymore and thus, a subsequent write from client 1 would fail if it's updating a node or deleting the node.

So, I'm not really sure as in both cases hash collisions might matter, even though I'll still check at least in the diffing case the unique nodeKeys and maybe also the names/values of the nodes. But still, hmm. I've used BigInteger for rolling hash computations after the real hash for new leaf nodes has been computed to update ancestor nodes, but computations in a data store are super slow, so that was a rather bad idea, of course maybe due to the allocations, too and the slow SHA256 hash computations.

Hello sir ,
I am new to open source contribution.
I already know java , my tech stacks & tools includes C, C++ , Python , Java, JavaScript , HTML , CSS , SQL , Bootstrap, ReactJS, ExpressJS, NodeJS & Git . I need a little help from your side to contribute to these amazing projects.

Please see https://github.com/google/guava/wiki/HowToContribute

A most excellent way to help would be to find a method that we aren't unit-testing very well yet, and write a better test for it.