petergaultney / One-Pass-Parallel-Descriptive-Statistics

C++ and Python implementations of the parallel version of Welford's online algorithm for descriptive stats

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Contained are a tested/working Python and C++ implementation of the parallel version of Welford's algorithm for calculating simple descriptive statistics in a single pass over the data. https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

The advantage of this approach is twofold: Not only can you calculate these statistics as your data points arrive, for, e.g., a live dashboard, but you can actually distribute your calculation across multiple 'groups', whether phyiscal hardware (servers) or time spans (e.g. in 5 minute 'windows'), and merge those groups at arbitrary times to get the overall statistics for your full population.

In the long run, this provides the advantage of allowing reduced data storage, as you can aggregate and store the data in whatever granularity you need, allowing you to trade off storage, compute, and precision within a large parameter space.

About

C++ and Python implementations of the parallel version of Welford's online algorithm for descriptive stats

License:Apache License 2.0


Languages

Language:Python 75.0%Language:C++ 25.0%