Potential incorrect implementation

Question

Potential incorrect implementation

GuillaumeLeclerc opened this issue 7 years ago · comments

Hello,

I have this piece of code that compare the results of the js-xxhash implementation with the reference one.

const { hash: binaryProc } = require('xxhash')
const { h32: jsProc } = require('xxhashjs');
const data = 'this is some random piece of data';
const {UINT32: UINT} = require('cuint');
const random = require('random-buffer');

function hashBinary(data, seed) {
  if (typeof seed === 'string') {
    seed = Buffer.from(seed, 'hex');
  }
  if (typeof data === 'string') {
    data = Buffer.from(data, 'ascii');
  }
  return binaryProc(data, seed, 'hex')
}

function hashJS(data, seed) {
  seed = new UINT(0).fromString(seed, 16);
  return jsProc(data, seed).toString(16).match(/.{2}/g).reverse().join("");
}

let valid = 0;
let invalid = [];
const COUNT = 10000;
for (var i = 0 ; i < COUNT ; i++) {
  const buf = random(4);
  const seed = buf.toString('hex');
  const b = hashBinary(data, seed);
  const j =  hashJS(data, seed);
  if (b === j) {
    valid++;
  } else {
    invalid.push(new Uint8Array(buf.buffer))
  }
}
console.log(valid / COUNT * 100);

As we can see only 12.5% of the seeds actually return the same result with the two implementations.

I tried to figure it out and I might have a clue.

const { hash: binaryProc } = require('xxhash')
const { h32: jsProc } = require('xxhashjs');
const data = 'this is some random piece of data';
const {UINT32: UINT} = require('cuint');
const random = require('random-buffer');

function normalize_seed(seed) {
  for (var i = 0; i < seed.length; ++i) {
    seed[i] = seed[i] % 128;
  }
  return seed;
}

function hashBinary(data, seed) {
  if (typeof seed === 'string') {
    seed = Buffer.from(seed, 'hex');
  }
  if (typeof data === 'string') {
    data = Buffer.from(data, 'ascii');
  }
  return binaryProc(data, seed, 'hex')
}

function hashJS(data, seed) {
  seed = new UINT(0).fromString(seed, 16);
  return jsProc(data, seed).toString(16).match(/.{2}/g).reverse().join("");
}

let valid = 0;
let invalid = [];
const COUNT = 10000;
for (var i = 0 ; i < COUNT ; i++) {
  const buf = random(4);
  normalize_seed(buf);
  const seed = buf.toString('hex');
  const b = hashBinary(data, seed);
  const j =  hashJS(data, seed);
  if (b === j) {
    valid++;
  } else {
    invalid.push(new Uint8Array(buf.buffer))
  }
}
console.log(valid / COUNT * 100);

In this case I only use seeds where bytes are between 0 and 127. and the ratio of correct output reaches 93%. My guess is that the C implementation uses char (which are signed) and this one uses unsigned char and interpret the high bit of each byte differently.

However there must be another problem that makes the other 7% wrong.

Do you have any idea what is going on there ?

Pierre Curto · Answer 1 · Tue Jun 20 2017 02:02:31 GMT+0800 (China Standard Time)

Hmm this is interesting. I am not sure where the bug(s) lies: be it cuint or the js xxh code.
Unfortunately I am strapped on time right now so will have a look in a best effort way.

Julian Klug · Answer 2 · Mon Jul 17 2017 10:26:05 GMT+0800 (China Standard Time)

Same problem here

Guillaume Leclerc · Answer 3 · Mon Jul 24 2017 01:12:44 GMT+0800 (China Standard Time)

We know it is open source and I don't want you to feel any pressure (I know the feeling). But did you have time to look at it already ? Do you have any clue where it could come from so I could potentially start investigating ?

Pierre Curto · Answer 4 · Mon Jul 24 2017 20:10:11 GMT+0800 (China Standard Time)

I am sorry I havent had the time to look into this. I suspect the bug is in the cuint library.

Pierre Curto · Answer 5 · Mon Jul 31 2017 03:55:02 GMT+0800 (China Standard Time)

I have made some research on this and it appears I am getting wrong checksums not from js-xxhash but from xxhash itself.

I compared the results of js-xxhash, xxhash and the C reference implementation and got (picking random seeds):

seed	js-xxhash	xxhash	C xxhash
0xf00f85ee	8cb0299d	27a08d7c	8cb0299d
0xb1164e9f	116249b8	f9fbff2b	116249b8

Thoughts?

Pierre Curto · Answer 6 · Sun Aug 27 2017 16:31:21 GMT+0800 (China Standard Time)

Don't forget that the echo command appends a newline to the string. Taking this into account, I get:

var XXH = require('xxhashjs')

undefined

XXH.h32( 'testing123\n', 0 ).toString(16)

'21e4c81a'

2017-08-21 23:34 GMT+02:00 Jacob Loveless <notifications@github.com>:

…

In the above, I think xxhsum is defaulting the big endian (which might be the source of issue above?) But I'd note that I can't seem to get the xxhash-js to match either the reference xxhsum utility or a Golang version I also see a difference (I may be missing something on the node side). E.g. using the xxhash32 utility from Cyan's source (master) https://github.com/Cyan4973/xxHash/ This defaults to big endian and a seed of 0. ~/> echo "testing123" | ./xxh32sum 21e4c81a -. ~/> echo "testing123" | ./xxh32sum --little-endian 1ac8e421 -. But my node.js results differ: > var XXH = require('xxhashjs') undefined > XXH.h32( 'testing123', 0 ).toString(16) '153246a3' — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAWeZAgZaBWXMdkh3Jp9FGtyUlpPXmhNks5saffYgaJpZM4N9HaU> .

-- "The world today doesn't make any sense, why should I paint pictures that do?"—Picasso

Cluster Atlas · Answer 7 · Tue Oct 24 2017 22:53:20 GMT+0800 (China Standard Time)

Any updates on this? How does this affect the use of the library though?

Jacob Loveless · Answer 8 · Tue Oct 24 2017 23:03:44 GMT+0800 (China Standard Time)

This was an erroneous report on my part. This code is functioning correctly and matches all other implementations I've tested (C/Java/Go)

…

On Tue, Oct 24, 2017 at 10:53 AM, Cluster Atlas ***@***.***> wrote: Any updates on this? How does this affect the use of the library though? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATz-ctiTu_frEdnBn5pgYuagS4eRA3OZks5svfnhgaJpZM4N9HaU> .

Pierre Curto · Answer 9 · Thu Oct 26 2017 01:27:10 GMT+0800 (China Standard Time)

In which case I am closing this issue. Feel free to reopen one if needed.