A collection of algorithms for mining data streams, including frequent itemsets, quantiles, sampling, moving averages, set membership and cardinality.




Frequent Itemsets


  • CountSketch [1]
  • CountMinSketch [2]
  • LossyCounting [3]
  • Majority [4]
  • MisraGries [5]
  • SpaceSaving [6]
  • StickySampling [3]
  • RealCounting
  • SimpleTopKCounting
  • TimeDecayCountMinSketch
  • TimeDecayRealCounting
  • AMSSketch
  • CCFCSketch
  • CGT


Except for the CountMinSketchAlt class, all algorithms implement the IRichFrequency interface. Here's an example using the SpaceSaving algorithm:

Random r = new Random();
int counters = 20;
double support = 0.01;
double maxError = 0.1;

IRichFrequency<Integer> counter = new SpaceSaving<Integer>(counters, support, maxError);
for (int i=0 i<1000; i++) {
    counter.add(r.nextInt(100), 1);

// get the top 10 items
List<CountEntry<Integer>> topk = counter.peek(10);

// print the items
for (CountEntry<Integer> item : topk) {

// get the frequency of a single item
int item = 25;
long freq = counter.estimateCount(item);
System.out.println(item + ": " + freq);

Time Decaying Algorithms

TimeDecayRealCounting and TimeDecayCountMinSketch are algorithms that use a decay function to update the current values of their counts in order to give more importance to newer values, while older values will slowly fade away.

The decay function implements the DecayFormula interface. Currently there are three implementations: the exponential (ExpDecayFormula), the linear (LinDecayFormula), and the logarithmic (LogDecayFormula).

Those counting algorithms implement a different interface called ITimeDecayFrequency, as both methods for adding and estimating the frequency need an additional argument, the timestamp.



  • StreamSummary [6]
  • ConcurrentStreamSummary
  • Frequent
  • StochasticTopper


The basic usage of a Top-K algorithm is basically the same as the frequent itemset, except that these algorithms do not support the estimateCount method.

ITopK<String> counter = new StreamSummary<String>(3);

String[] stream = {"X", "X", "Y", "Z", "A", "B", "C", "X", "X", "A", "C", "A", "A"};
for (String i : stream) {

List<CountEntry<String>> topk = counter.peek(3);
for (CountEntry<String> item : topk) {



  • CKMSQuantiles [7]
  • Frugal2U [8]
  • GKQuantiles [9]
  • MPQuantiles [10]
  • QDigest [11]
  • WindowSketchQuantiles [12]
  • RSSQuantiles [13]
  • EnsembleQuantiles
  • ExactQuantiles
  • ExactQuantilesAll
  • SimpleQuantiles
  • SumQuantiles
  • TDigest


double[] quantiles = new double[]{0.05, 0.25, 0.5, 0.75, 0.95};
IQuantiles<Integer> instance = new Frugal2U(quantiles, 0);

RandomEngine r = new MersenneTwister64(0);
Normal dist = new Normal(100, 50, r);
int numSamples = 1000;
for(int i = 0; i < numSamples; ++i) {
    int num = (int) Math.max(0, dist.nextDouble());

for (double q : quantiles) {
    System.out.println(q + ": " + instance.getQuantile(q));



  • AdaptiveCounting [14]
  • LogLog [15]
  • HyperLogLog [16]
  • HyperLogLogPlus [17]
  • LinearCounting [18]
  • CountThenEstimate
  • BJKST [26]
  • FlajoletMartin [27]
  • KMinCount


ICardinality card = new LogLog(8);

for (int i=0; i<100; i++) {

System.out.println("Cardinality: " + card.cardinality());



  • MovingAverage
  • ExponentialMovingAverage
  • SimpleEWMA
  • VariableEWMA
  • TEWMA [25]


// create a EWMA with 15 seconds of age for the metrics in the period
IAverage avg = new VariableEWMA(15.0);

for (int i=0; i<100; i++) {
    if (i%10 == 0)
        System.out.println("Average: " + avg.getAverage());



  • BloomFilter [22]
  • BloomFilterAlt (alternative implementation)
  • CountingBloomFilter [19]
  • VarCountingBloomFilter (with variable bucketsPerWord)
  • DynamicBloomFilter [20]
  • RetouchedBloomFilter [21]
  • StableBloomFilter [23]
  • TimingBloomFilter [24]
  • ODTDBloomFilter [28]


IFilter bloom = new BloomFilter(1000, 32, Hash.MURMUR_HASH);

for (int i = 0; i < 100; i++) {
    String val = UUID.randomUUID().toString();
    Key k = new Key(val.getBytes());
    System.out.println(val + " exists? " + bloom.membershipTest(k));



  • BernoulliSampler
  • ChainSampler [29]
  • ReservoirSampler
  • SystematicSampler
  • WRSampler (With Replacement)
  • WeightedRandomSampler
  • L0Sampler [30]
  • SpaceSavingSampler
  • FrequentSampler


// Create a sampler with 30% probability
ISampler sampler = new BernoulliSampler(0.3);

Random rand = new Random();

// Create a dummy stream of ints
List<Integer> stream = new ArrayList<Integer>(1000);
for (int i=0; i<1000; i++)

for (Integer tuple : stream) {
    if ( {
        // tuple was sampled, do something
    } else {
        // tuple was ignored, move on



  • Perceptron
  • NaiveBayes
  • NaiveBayesWOP
  • BoundedBayes
  • LossyBayes
  • MultiBayes
  • MultiLossyBayes
  • MultiTopKBayes
  • SticySamplingBayes
  • TopKBayes
  • MajorityClass
  • RandomClassifier
  • MultiRandomClassifier
  • AROWClassifier (Adaptive Regularization of Weight Vectors) [32]
  • BWinnowClassifier (Balanced Winnow Classifier) [33]
  • PAClassifier, MultiClassPAClassifier [34]
  • WinnowClassifier


NaiveBayes nb = new NaiveBayes();
ICsvListReader listReader = new CsvListReader(
        new FileReader("src/test/resources/golf.csv"), 


List<String> list;
while( (list = != null ) {
    Data data = new DataImpl();
    data.put("outlook", list.get(0));
    data.put("temperature", Integer.parseInt(list.get(1)));
    data.put("humidity", Integer.parseInt(list.get(2)));
    data.put("wind", Boolean.parseBoolean(list.get(3)));
    data.put("play", list.get(4));


Data test = new DataImpl();
test.put("outlook", "sunny");
test.put("temperature", "cool");
test.put("humidity", "high");
test.put("windy", "TRUE");

String prediction = nb.predict(test);
System.out.println("Item is: " + test);
System.out.println("Prediction is: " + prediction);



  • K-Means
  • BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [31]


