twitter / algebird

Abstract Algebra for Scala

Home Page:https://twitter.github.io/algebird

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A monoid for sliding time window aggregations

dasch opened this issue · comments

I've started using Algebird in conjunction with Google Dataflow, which works great! My main data structure is a deeply nested case class that is implicitly a monoid – it contains a bunch of stuff, mainly time-sorted Max values (such that I only keep the most recent version of some value).

My needs are a bit specific, though, and the built-in time window aggregation semantics are causing some complications. I'd like to ask if you have any experience implementing windowed aggregations using monoids, or whether you'd be interested in accepting such an implementation were I to write one.

My current idea is loosely based on TopK – let's just call it SlidingWindow. It keeps k "slots", each containing an instance of another monoid and a timestamp (the timestamp will be normalized according to the period of the time window, e.g. start-of-day). When adding two of these together, each slot that maps to the same timestamp will be added together, but only the k most recent slots will be retained. There would be a method that returned the "sum" of all the slots, representing, say, the total count of some event during the last k days/minutes/whatever.

Does this design make sense? I've tried looking, but I have seen little mention of time as a factor here, although it's critical in many streaming applications.

Sorry for the bother if this is the wrong place to ask.

@johnynek mentions something related here "scale.bythebay.io: Oscar Boykin, How to Elm-ify Your ML" https://youtu.be/k1jhDcruiLE?t=2142 but would like to see some code. :-)

We've got a pretty nice approximate sliding window implementation here: http://twitter.github.io/algebird/datatypes/approx/exponential_histogram.html

This might help!