choice of returned histogram datatype

Question

choice of returned histogram datatype

manodeep opened this issue 7 years ago · comments

Thanks for writing this fast histogram package. A task that we astronomers tend to do often and repeatedly :)

Coming back to the package itself, it seems to me that the return datatype of the histograms are doubles. Conceptually, the histogram is simply a count in that bin, and should be of an integral type. Is the return datatype chosen to be double to reflect the input array type? Or is that in keeping with the numpy convention?

Changing the histogram increment to be an integral type (i.e., the postfix ++ operator) might also lead to better performance. (There are some opportunity for performance improvements -- e.g., pointer operations, reduction of scope for variables, unrolled loops + separate counters)

Separately, any particular reason to perform a bitwise and (&) rather than the logical and (&&) here?

Thomas Robitaille · Answer 1 · Wed Jul 19 2017 15:47:48 GMT+0800 (China Standard Time)

I decided to return doubles for consistency with Numpy, but we could potentially have an option to choose the return type (at least maybe choosing between double and long?). All the other improvements sound good, and the use of && was a mistake :)

Would you be interested in submitting a pull request to implement/fix some of these as well as #4?

You mention reduction of scope for variables, but just to let you know, I found that all the variables had to be declared at the top to keep the Windows compilers happy (but maybe you know how to avoid this).

Manodeep Sinha · Answer 2 · Wed Jul 19 2017 16:37:18 GMT+0800 (China Standard Time)

I sent up the trivial stuff in the PR #6.

I have never compiled for Windows. What's the error you get on Windows with these lines:

const double postfac = normx * fnx;
for(i = 0; i < n; i++) {
    if (*x >= xmin && *x < xmax) {
        const long ix = (*x - xmin) * postfac;
        count[ix] += 1.;
    }
   x++;
}

Thomas Robitaille · Answer 3 · Wed Jul 19 2017 17:07:26 GMT+0800 (China Standard Time)

@manodeep - Thanks! I don't have Windows locally, but you can try and open a PR, and there will be an AppVeyor build in which you can check any error messages if the build fails.

Henry Schreiner · Answer 4 · Thu Nov 01 2018 20:28:10 GMT+0800 (China Standard Time)

FYI, item += 1 stops working if the numbers get large enough for floating point numbers. However, that number is very large for doubles.

Henry Schreiner · Answer 5 · Fri Nov 02 2018 07:57:53 GMT+0800 (China Standard Time)

Also, numpy returns int64s:

import numpy as np
a,b = np.histogram([1,2,.3])
print(a.dtype)
# prints int64

Thomas Robitaille · Answer 6 · Fri Nov 02 2018 21:46:04 GMT+0800 (China Standard Time)

I wonder if this changed in Numpy since I wrote this package. I'd certainly be up for returning integer values from the plain histogram and doubles from the weighted one, though I need to check if that might have any undesired side effects on other packages.

Henry Schreiner · Answer 7 · Sat Nov 03 2018 02:56:19 GMT+0800 (China Standard Time)

I think this changed in 1.10, at least the 64 part of the int was added then. This is also when numpy got a high-speed 1D histogram algorithm. I wonder how it compares to fast-histogram now in 1D?

numpy/numpy@34b582a#diff-9d44603693b9544b900e135a203b3ae2R180, numpy/numpy#7629

Edit, per this issue numpy/numpy#7845, 2D/ND histograms still return a floating type.

Thomas Robitaille · Answer 8 · Sat Nov 03 2018 04:23:20 GMT+0800 (China Standard Time)

The current benchmarks shown in the README are against a later version than 1.10, so this is still quite a bit faster than numpy. I'm not sure if you are familiar with the asv tool for benchmarks, but I keep a running benchmark here of how fast fast-histogram is compared to Numpy: https://astrofrog.github.io/fast-histogram/#benchmarks.Histogram1D.time_histogram1d?x-axis=commit

Henry Schreiner · Answer 9 · Mon Nov 05 2018 02:42:33 GMT+0800 (China Standard Time)

By the way, the current Numpy code is internally in int64s, and then casts at the end. There's one line that can be dropped to stop the casting. This allows Numpy to count past the point that floating point can no longer add 1.