astrofrog / fast-histogram

:zap: Fast 1D and 2D histogram functions in Python :zap:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

histogram result has strange spikes as compared to numpy histogram

sahaskn opened this issue · comments

Data file : https://ufile.io/cnj9l

Python Code:

import numpy as np
import matplotlib.pyplot as plt
from fast_histogram import histogram1d

data = np.load('x.npz')
h_np, _ = np.histogram(data['x'], bins=1100, range=[0, 1100] )
h_fast = histogram1d(data['x'], bins=1100, range=[0, 1100] )

plt.plot(h_np[:-1], 'r--', label='numpy')
plt.plot(h_fast[:-1], label='fast')
plt.legend()

The result is as :
image

Also in histogram2d, the spikes are there.

@sahaskn - the issue is that the values you are reading in are integer values (well, they are float32, but they are round values such as e.g. 1.0). I think this is causing some deterministic behavior in cases where the bin edges line up exactly with the values. For instance, if you had values of 0.0, 1.0, and 2.0, and the histogram went from 0 to 2 with two bins, it's not clear which bin the value 1.0 should fall in. Could you try changing the number of bins to see if it's just an issue when bins is 1100? I can investigate how Numpy treat these 'edge-cases' to see if we can be more consistent with them.

Just to add to my comment above, given the values you have, you can avoid non-deterministic effects by choosing range=[-0.5, 1100.5] and bins=1101 so that the values fall at the center of the bins.

@astrofrog. Thanks for the reply. Numpy documents shows that in histogramming, lower edge is included and upper edge is excluded for all bins except the last bin. And I assumed the way fast_histogram also works except for last bin where the upper edge is not included.
Thanks for suggesting range=[-0.5, 1100.5]] with bins=1101 which worked.

Histogram2d is really fast!!!