rsmith-nl / ent

Python implementation of John Walker's ent program.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discrepancy in χ² result

ncoder-1 opened this issue · comments

Hi!

I just wanted to start off by offering my appreciation for your python port of ent. Your code style is excellent and very easy to understand. I myself am porting ent to C++ as a learning experiment and your python port has been a great help in understanding the math behind most of the functions.

I was wondering if you had noticed any discrepancy between your port and the original ent in regards to calculating χ². With large input files, the output matches between both versions you have (with and without numpy) as well as the original ent and my implementation. But in small files, χ² no longer matches.

For example, If I use your ent.py as an input file:

./ent_without_numpy.py ent.py produces:

χ² distribution for 7982 samples is 165337.59

./ent.py ent.py produces:

χ² distribution for 7982 samples is 169359.77

(original) ent ent.py produces:

Chi square distribution for 7982 samples is 170263.98

and my implementation gives:

χ² distribution for 7982 samples is 170263.981458

Like I said, once the files get larger the output now all match on all 4 implementations... I was wondering if you had seen this on your end.

If you're curious, my C++ implementation of the χ² calculation is (not as pretty as yours but):

auto ComputePearsonChiSquare() -> double {
  auto sum = std::accumulate(char_map_.begin(), char_map_.end(), 0.0, [](auto current_val, auto element) { return current_val + element.second; }) / 256.0;

  auto result = std::accumulate(char_map_.begin(), char_map_.end(), 0.0,
                                [&sum](auto current_val, auto element) { return current_val + (std::pow(element.second - sum, 2)) / sum; });

  return result;
}

with char_map_ being an unordered_map of the byte count (your equivalent of the counts variable).

At first blush, I cannot explain the difference. All algorithms for χ² seem identical. And Python uses double internally for floating point calculations.

So the only explanation I can come up with at this time is the inexact nature of binary floating point calculations, and maybe some (internal) rounding being done by Python/numpy.

When I replace float with decimal.Decimal in pearsonchisquare, I get exactly the same result as ent_without_numpy.
Using mpmath instead also yields the same result.

Closing this for now, since I cannot find a cause.