Discrepancy in χ² result
ncoder-1 opened this issue · comments
Hi!
I just wanted to start off by offering my appreciation for your python port of ent
. Your code style is excellent and very easy to understand. I myself am porting ent
to C++ as a learning experiment and your python port has been a great help in understanding the math behind most of the functions.
I was wondering if you had noticed any discrepancy between your port and the original ent in regards to calculating χ². With large input files, the output matches between both versions you have (with and without numpy) as well as the original ent
and my implementation. But in small files, χ² no longer matches.
For example, If I use your ent.py
as an input file:
./ent_without_numpy.py ent.py produces:
χ² distribution for 7982 samples is 165337.59
./ent.py ent.py produces:
χ² distribution for 7982 samples is 169359.77
(original) ent ent.py produces:
Chi square distribution for 7982 samples is 170263.98
and my implementation gives:
χ² distribution for 7982 samples is 170263.981458
Like I said, once the files get larger the output now all match on all 4 implementations... I was wondering if you had seen this on your end.
If you're curious, my C++ implementation of the χ² calculation is (not as pretty as yours but):
auto ComputePearsonChiSquare() -> double {
auto sum = std::accumulate(char_map_.begin(), char_map_.end(), 0.0, [](auto current_val, auto element) { return current_val + element.second; }) / 256.0;
auto result = std::accumulate(char_map_.begin(), char_map_.end(), 0.0,
[&sum](auto current_val, auto element) { return current_val + (std::pow(element.second - sum, 2)) / sum; });
return result;
}
with char_map_
being an unordered_map of the byte count (your equivalent of the counts
variable).
At first blush, I cannot explain the difference. All algorithms for χ² seem identical. And Python uses double
internally for floating point calculations.
So the only explanation I can come up with at this time is the inexact nature of binary floating point calculations, and maybe some (internal) rounding being done by Python/numpy.
When I replace float
with decimal.Decimal
in pearsonchisquare
, I get exactly the same result as ent_without_numpy
.
Using mpmath instead also yields the same result.
Closing this for now, since I cannot find a cause.