different result compared to numpy

Question

different result compared to numpy

d5423197 opened this issue 2 years ago · comments

Hello there,

I am trying to use this repo to replace numpy but get different result.

I put range as the minimum of the input and the maximum of the input. But I found out that the result is missing some maximum value.

For example,

test_case = np.array([1, 1, 2, 2, 3, 3, 10, 10]) freq, bins = np.histogram(test_case, range(np.min(test_case), np.max(test_case + 1))) result = histogram1d(test_case, bins=9, range=(np.min(test_case), np.max(test_case)))

Zhonghan Deng · Answer 1 · Thu Jan 19 2023 09:31:13 GMT+0800 (China Standard Time)

Is this repo still maintained?

Zhonghan Deng · Answer 2 · Thu Jan 19 2023 09:51:40 GMT+0800 (China Standard Time)

For numpy 1d histogram function, if you set bins as 10, the returned hist would be length of 9. But for fast histogram 1d function, if you set bins as 10, the returned hist would be length of 10 which is inconsistent.

test_case = np.array([1, 1, 2, 2, 3, 3, 10, 10])
freq, bins = np.histogram(test_case, bins=range(np.min(test_case), np.max(test_case + 1)))
test = np.bincount(test_case, minlength=9)
result = histogram1d(test_case, bins=10, range=(np.min(test_case), np.max(test_case)))
result_1 = histogram1d(test_case, bins=9, range=(np.min(test_case), np.max(test_case) + 1))
result_2 = histogram1d(test_case, bins=10, range=(np.min(test_case), np.max(test_case) + 1))

I realized that fast histogram set the upper range as excluded which is inconsistent with numpy. Correct me if I am wrong.

I have tried many ways. The result_2 is the closest one but with a length of 10.

I really want to replace numpy histogram with a fast histogram. But I need the same result.

Thomas Robitaille · Answer 3 · Thu Jan 19 2023 17:11:59 GMT+0800 (China Standard Time)

Yes this is still maintained - will respond soon!

Thomas Robitaille · Answer 4 · Thu Jan 19 2023 18:43:39 GMT+0800 (China Standard Time)

@d5423197 if you are trying to bin integers, I highly recommend using np.bincount - what you are seeing here is a subtle difference between Numpy and fast-histogram which is that indeed if a value is exactly the same as the upper bound of the range then it will not be included in fast-histogram (this is for performance). If you prefer not to use np.bincount (which should be the fastest if you really are trying to bin integers) then another option is to add a tiny value to the upper end of the range when calling fast-histogram, e.g, instead of binning from 0 to 10 you would bin from 0 to 10 + 1e-30 or similar. Does this make sense?