wangyi-fudan / wyhash

The FASTEST QUALITY hash function, random number generators (PRNG) and hash map.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Streaming hash

smarchini opened this issue · comments

For what I can tell, wyhash is designed with short strings in mind. When dealing with large strings or with any kind of streaming data, it is common practice to compute the hash incrementally by pausing the hash function and resuming from where where you left it. This feature has been discussed in the past, for example #28, #45 and #102. It used to be a thing too

static inline void wyhash_update(wyhash_context_t *const __restrict ctx,
but it has been removed in the newer versions.

In what I think is the most recent thread about it (#102), @wangyi-fudan said wyhash is in very active development and maintaining a continuing hash interface is a bit of a burden. Nevertheless, it is totally possible (with some caveat) for wyhash to exhibit a pause & resume behavior. I have an implementation ready, so I'm opening this issue to discuss what should we with it because I think it can be useful to other people too.

Here's my implementation: wyhash_streaming.c. I'm already running a test to see if they compute the same value.

If you don't wanna merge it into wyhash.h I think it would be nice to have it documented somewhere (e.g., in an example file). Otherwise we can discuss on how the exposed interface should look like, if you like it as it is or of you want it to be different, then I can open a PR.

Note the two caveats (1, 2):

  1. the chunk-size should be a multiple of 48, because wyhash handles short strings differently; and
  2. in the finalizing part you need to have data left to read, because wyhash uses > instead of >=.

@smarchini: I have tried that before and there is a problem which you briefly, almost enigmatically mentioned in caveat 2 - last round (finalizing part as you said) takes 1-48 bytes, and that 48 is that problem as you have to feed digest part (wyhash_update()) with full round (of 48 bytes) and know, whether it is last 48 bytes, and then mix it in final stage (wyhash_finalize()). The thing is - in streaming scenario, one doesn't know whether it is a last byte/s or not, so there is no way to recognize 48 bytes received on input as last bytes or not, so there is no way to know whether to use full round or finalize. It would only work if that thing would change, and every multiple of full round (here: 48 bytes) was digested (wyhash_update()) and not finalized. That is, what you meant as wyhash uses > instead of >=, in line 127. And yes, it would make it streamable if that thing was changed - that comparison changed to >=.

--- wyhash.h~	2023-09-28 17:07:03.953024424 +0000
+++ wyhash.h	2023-09-28 17:35:22.079782532 +0000
@@ -124,14 +124,14 @@ static inline uint64_t wyhash(const void
   }
   else{
     size_t i=len; 
-    if(_unlikely_(i>48)){
+    if(_unlikely_(i>=48)){
       uint64_t see1=seed, see2=seed;
       do{
         seed=_wymix(_wyr8(p)^secret[1],_wyr8(p+8)^seed);
         see1=_wymix(_wyr8(p+16)^secret[2],_wyr8(p+24)^see1);
         see2=_wymix(_wyr8(p+32)^secret[3],_wyr8(p+40)^see2);
         p+=48; i-=48;
-      }while(_likely_(i>48));
+      }while(_likely_(i>=48));
       seed^=see1^see2;
     }
     while(_unlikely_(i>16)){  seed=_wymix(_wyr8(p)^secret[1],_wyr8(p+8)^seed);  i-=16; p+=16;  }