Streaming histograms
An implementation of the streaming histograms algorithm as described in A Streaming Parallel Decision Tree Algorithm by Yael Ben-Haim and Elad Tom-Tov (2010).
The streaming histogram is defined in terms of bins
where
The histogram is created by treating the newly arriving datapoint
which is taken as a new bin replacing them. Histograms can also be resized or merged by merging the closest bins.
Statistics
The weighted mean and variance of such bins can be used to approximate sample mean and variance. Yael Ben-Haim and Elad Tom-Tov (2010) describe also algorithms for approximating the sample quantiles and empirical cumulative probability distribution by applying the trapezoidal rule to interpolate between the bins.
Kernel density estimation
Additionally, a weighted kernel density estimator may be used for approximating the probability density function of the data. The estimator is defined as
with weights
Command line interface
For example, the following command pipes the tab-separated file (ignoring the first line which is the header with
tail -n +2
) to histr
. The histogram is saved to a file (-o hist.msgpack
) and printed. The saved histogram
could then be read again (with -l hist.msgpack
) and be updated with new data.
$ tail -n +2 examples/old_faithful.tsv | histr -o hist.msgpack
mean count
1.855946 56 β β β β β β β β β β
2.162333 27 β β β β β
2.436364 11 β β
2.912500 4 β
3.402125 8 β
3.674462 13 β β
3.987889 36 β β β β β β
4.297208 48 β β β β β β β β β
4.622364 55 β β β β β β β β β β
4.919000 14 β β β
Instead of piping, the file could be passed directly as histr examples/old_faithful.tsv
, but then we would see
a warning printed to the standard error saying that parsing the first line (column name) failed.
It can be used with other command line programs, for example, to estimate the histogram of response times from ping.
$ ping google.com -c 20 | sed -n 's/.*time=\([0-9.]*\).*/\1/p' | histr -b 5
mean count
8.965000 2 β β
10.13000 10 β β β β β β β β β β
11.20000 3 β β β
13.22500 4 β β β β
18.00000 1 β
More details can be found in histr -h
and some usage examples can be executed using the Justfile in this
repository with just examples
.
Library
Histr is also available as a Rust crate. It supports creating histograms from data or building them on-the-fly in a streaming manner. The histograms can be resized and merged with other histograms. The crate exposes methods for calculating the basic statistics (mean, standard deviation, median, quantiles) from the histograms and calculating empirical cumulative distribution functions of kernel density estimators from them.
use histr::StreamHist;
use histr::KernelDensity;
// initialize a histogram with 10 bins
let mut hist = StreamHist::with_capacity(10);
// add some values to it
hist.insert(1.13);
hist.insert(2.67);
// ...
// calculate statistics
println!("Mean = {}", hist.mean());
// convert it to a kernel density estimator
let kde = KernelDensity::from(hist.clone());
println!("f({}) = {}", 3.14, kde.density(3.14));
// print the histogram as a JSON
println!("{}", hist.to_json());
To use it, specify it in Cargo.toml
as:
[dependencies]
histr = { git = "https://github.com/twolodzko/histr.git" }
Other implementations
Similar implementations are also available in carsonfarmer/streamhist (Python), maki-nage/distogram (Python), VividCortex/gohistogram (Go), aaw/histosketch (Go), bigmlcom/histogram (Java/Clojure), aaw/histk (C), malor/bhtt (Rust), jettify/streamhist (Rust), etc. They vary in maturity and features, and some do not implement the approach described by Yael Ben-Haim and Elad Tom-Tov (2010) or diverge from it.