astrofrog / fast-histogram

:zap: Fast 1D and 2D histogram functions in Python :zap:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

different result compared to numpy

d5423197 opened this issue · comments

Hello there,

I am trying to use this repo to replace numpy but get different result.

I put range as the minimum of the input and the maximum of the input. But I found out that the result is missing some maximum value.

For example,

test_case = np.array([1, 1, 2, 2, 3, 3, 10, 10]) freq, bins = np.histogram(test_case, range(np.min(test_case), np.max(test_case + 1))) result = histogram1d(test_case, bins=9, range=(np.min(test_case), np.max(test_case)))

Is this repo still maintained?

For numpy 1d histogram function, if you set bins as 10, the returned hist would be length of 9. But for fast histogram 1d function, if you set bins as 10, the returned hist would be length of 10 which is inconsistent.

test_case = np.array([1, 1, 2, 2, 3, 3, 10, 10])
freq, bins = np.histogram(test_case, bins=range(np.min(test_case), np.max(test_case + 1)))
test = np.bincount(test_case, minlength=9)
result = histogram1d(test_case, bins=10, range=(np.min(test_case), np.max(test_case)))
result_1 = histogram1d(test_case, bins=9, range=(np.min(test_case), np.max(test_case) + 1))
result_2 = histogram1d(test_case, bins=10, range=(np.min(test_case), np.max(test_case) + 1))

I realized that fast histogram set the upper range as excluded which is inconsistent with numpy. Correct me if I am wrong.

I have tried many ways. The result_2 is the closest one but with a length of 10.

I really want to replace numpy histogram with a fast histogram. But I need the same result.

Yes this is still maintained - will respond soon!

@d5423197 if you are trying to bin integers, I highly recommend using np.bincount - what you are seeing here is a subtle difference between Numpy and fast-histogram which is that indeed if a value is exactly the same as the upper bound of the range then it will not be included in fast-histogram (this is for performance). If you prefer not to use np.bincount (which should be the fastest if you really are trying to bin integers) then another option is to add a tiny value to the upper end of the range when calling fast-histogram, e.g, instead of binning from 0 to 10 you would bin from 0 to 10 + 1e-30 or similar. Does this make sense?