janpipek / physt

Python histogram library - histograms as updateable, fully semantic objects with visualization tools. [P]ython [HYST]ograms.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

approximate histograms

belm0 opened this issue · comments

I'm following the paper (http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf) implemented by https://github.com/carsonfarmer/streamhist, and the notion of approximate histograms seems elegant and efficient.

After seeing the internals of streamhist (trying to fix bugs) and reading the paper, I can imagine ways to make a better implementation: e.g. much more efficient discovery of bins to be joined, and avoiding temporary lists when possible. Also the code seems overly complex, partially due to features like "bin freezing" which try to workaround poor bin joining performance.

Anyway since streamhist is defunct, I'm thinking about trying an implementation. I wonder if this kind of histogram would fit into physt (and if sortedcollections would be reasonable as a dependency).

I've prototyped a new implementation of the Ben-Haim streaming histogram in pure Python, no package dependencies. Given a max bin count of 64 (streamhist default), adding a value takes 12 usec vs. 36 usec for streamhist (on 5 year old MacBook Air). That's a compelling speed given that the algorithm dynamically adjusts bins on every insert at steady state. It's possible to maintain that insert speed at much larger max bin counts (say 1,000), but the implementation gets significantly more complex (maintain a sorted container of bin distances).

More interesting is how flexible the algorithm is regarding bin merging strategies. The strategy in the Ben-Haim paper merges nearest bins, which I suppose preserves the most information about a distribution. However with a simple tweak it can instead maintain equal bin sizes-- ideal for arbitrary percentile queries. Then with some cheating I've made a "best effort" equal bin size strategy that is amortized O(1) regardless of max bin count, with fill running at 3 usec. Other merge strategies are possible, such as providing specific percentiles using a minim number of bins (e.g. accurate 50% and 90% using only 5 bins).

implementation update

I improved the fill performance of the simple Python implementation and added a numba implementation:

  • 64 bins: streamhist 30 µs, new 6.5 µs, new_numba 2.3 µs
  • 1024 bins: streamhist 192 µs, new 31 µs, new_numba 4.7 µs

(I wonder if physt would take on numba as extra_requires.)

I've implemented equivalent of numpy.quantile(). It matches quantile() output exactly until max_bins is reached, then transitions gracefully to approximate quantiles.

Sorry, this will probably never be implemented :-/