Bug: Histogram2d result does not match that of the numpy function

Question

Bug: Histogram2d result does not match that of the numpy function

jonathanishhorowicz opened this issue 7 years ago · comments

Jonathan Ish-Horowicz commented 7 years ago

Thank you for this package, it is indeed very fast and I am looking forward to using it.

There is a discrepancy between your histogram2d function and the numpy equivalent, which is demonstrated below:

import numpy as np
from fast_histogram import histogram2d as fast_hist_2d

n_bins = 5	

x = np.array([3,2,5,4,6,2,3,5,3,2,5,3,6,7,3])
y = np.array([1,6,3,2,5,3,1,6,4,6,4,2,4,5,3])

xmin, xmax = np.min(x), np.max(x)
ymin, ymax = np.min(y), np.max(y)

h_fast = fast_hist_2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])


h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0]

print "h_fast =\n", h_fast
print "\nh_np =\n", h_np

On my machine this outputs

h_fast =
[[ 0. 0. 1. 0. 0.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 1.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

The fast_histogram result only includes 11 of the 15 points. The 4 missing points are all from the final column

Thomas Robitaille · Answer 1 · Sat Aug 05 2017 04:46:19 GMT+0800 (China Standard Time)

Thanks for reporting this - I think this is a literal 'edge' case where I think we must be treating points that lie on the exact edge differently, but I agree it'd probably be good to make them consistent.

Thomas Robitaille · Answer 2 · Sat Aug 05 2017 04:51:50 GMT+0800 (China Standard Time)

The relevant line is here: https://github.com/astrofrog/fast-histogram/blob/master/fast_histogram/_histogram_core.c#L186 - specifically here we use < instead of <= for the upper bound, otherwise the index calculation returns an out of bounds index. We could change < to <= and then special case the case where the index needs to be reduced by 1, though this would add an additional 'if' statement in the loop which could further slow things down. I'll try and investigate early next week, but in the mean time the solution is to not use exactly the same value for xmax/ymax as some of the values in the arrays being binned (for example one could add a tiny value to each of the upper limits)

Jonathan Ish-Horowicz · Answer 3 · Mon Aug 07 2017 00:09:24 GMT+0800 (China Standard Time)

Yes, adding a small value to the upper limits for x and y does solve the problem for this example (and many others that I tested).

Thank you for your quick response!

Razib Obaid · Answer 4 · Tue Oct 31 2017 04:32:50 GMT+0800 (China Standard Time)

Hi,

One comment: your problem of discrepancies seems to be solved if you add a small value, for example, 0.01 to range that is shared by both np.histogram and the fast_histogram. For example as in,

xmin, xmax = np.min(x), np.max(x) + 0.01 
ymin, ymax = np.min(y), np.max(y) + 0.01

This is shared by both the functions. However, I was trying to test by not modifying anything in the numpy side, that is by doing:

h_fast = fast_hist_2d(x,y,
	range=[[xmin, xmax + 0.01], [ymin, ymax + 0.01]],
	bins = [n_bins,n_bins])          

h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0]

The result is not the same as shown below. Why do we have to pass the modified range to np.histogram as well?

h_fast =
[[ 3. 2. 1. 0. 2.]
[ 1. 0. 0. 0. 0.]
[ 0. 1. 1. 0. 1.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 0.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

Nelson · Answer 5 · Wed Mar 13 2019 21:57:41 GMT+0800 (China Standard Time)

Although increasing the range can act as a quick fix to the problem it doesn't really fix the problem.

Changing the range will affect the calculated edges (bin intervals), so numpy and fast_histogram will, in those examples return the same counts for each bin but with different edges (even though fast_histogram doesn't return the edges).
Depending on how much you increase the range and on the precision of your values this may still alter the bin counts.

Alexandra Trettin · Answer 6 · Thu Sep 30 2021 05:00:51 GMT+0800 (China Standard Time)

I believe that this issue can be closed now. We have decided to purposefully break with numpy on this particular point because the additional if statement to check for the edge case slows everything down by 20%. For the vast majority of use cases, the additional performance will be more appreciated than the adherence to numpy's convention.

Thomas Robitaille · Answer 7 · Thu Sep 30 2021 05:28:47 GMT+0800 (China Standard Time)

I agree! Before we close this I think we might want to add an entry to the FAQ.