astrofrog / fast-histogram

:zap: Fast 1D and 2D histogram functions in Python :zap:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug: Histogram2d result does not match that of the numpy function

jonathanishhorowicz opened this issue · comments

Thank you for this package, it is indeed very fast and I am looking forward to using it.

There is a discrepancy between your histogram2d function and the numpy equivalent, which is demonstrated below:

import numpy as np
from fast_histogram import histogram2d as fast_hist_2d

n_bins = 5	

x = np.array([3,2,5,4,6,2,3,5,3,2,5,3,6,7,3])
y = np.array([1,6,3,2,5,3,1,6,4,6,4,2,4,5,3])

xmin, xmax = np.min(x), np.max(x)
ymin, ymax = np.min(y), np.max(y)

h_fast = fast_hist_2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])


h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0]

print "h_fast =\n", h_fast
print "\nh_np =\n", h_np

On my machine this outputs

h_fast =
[[ 0. 0. 1. 0. 0.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 1.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

The fast_histogram result only includes 11 of the 15 points. The 4 missing points are all from the final column

Thanks for reporting this - I think this is a literal 'edge' case where I think we must be treating points that lie on the exact edge differently, but I agree it'd probably be good to make them consistent.

The relevant line is here: https://github.com/astrofrog/fast-histogram/blob/master/fast_histogram/_histogram_core.c#L186 - specifically here we use < instead of <= for the upper bound, otherwise the index calculation returns an out of bounds index. We could change < to <= and then special case the case where the index needs to be reduced by 1, though this would add an additional 'if' statement in the loop which could further slow things down. I'll try and investigate early next week, but in the mean time the solution is to not use exactly the same value for xmax/ymax as some of the values in the arrays being binned (for example one could add a tiny value to each of the upper limits)

Yes, adding a small value to the upper limits for x and y does solve the problem for this example (and many others that I tested).

Thank you for your quick response!

Hi,

One comment: your problem of discrepancies seems to be solved if you add a small value, for example, 0.01 to range that is shared by both np.histogram and the fast_histogram. For example as in,

xmin, xmax = np.min(x), np.max(x) + 0.01 
ymin, ymax = np.min(y), np.max(y) + 0.01

This is shared by both the functions. However, I was trying to test by not modifying anything in the numpy side, that is by doing:

h_fast = fast_hist_2d(x,y,
	range=[[xmin, xmax + 0.01], [ymin, ymax + 0.01]],
	bins = [n_bins,n_bins])          

h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0] 

The result is not the same as shown below. Why do we have to pass the modified range to np.histogram as well?

h_fast =
[[ 3. 2. 1. 0. 2.]
[ 1. 0. 0. 0. 0.]
[ 0. 1. 1. 0. 1.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 0.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

Although increasing the range can act as a quick fix to the problem it doesn't really fix the problem.

Changing the range will affect the calculated edges (bin intervals), so numpy and fast_histogram will, in those examples return the same counts for each bin but with different edges (even though fast_histogram doesn't return the edges).
Depending on how much you increase the range and on the precision of your values this may still alter the bin counts.

I believe that this issue can be closed now. We have decided to purposefully break with numpy on this particular point because the additional if statement to check for the edge case slows everything down by 20%. For the vast majority of use cases, the additional performance will be more appreciated than the adherence to numpy's convention.

I agree! Before we close this I think we might want to add an entry to the FAQ.