MicrosoftResearch / Naiad

The Naiad system provides fast incremental and iterative computation for data-parallel workloads

Home Page:http://microsoftresearch.github.io/Naiad/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

window based operators

TonyAbell opened this issue · comments

I am looking to do Tumbling Windows similar to what String Insight dose.

http://blogs.msdn.com/b/streaminsight/archive/2010/11/23/windows-part-1-the-basics.aspx

I do not currently see the operator in Naiad source.

Is it planned?

If not can you give guidance on how best to build it.

Hi Tony,

It should certainly be possible to build a tumbling window operator in Naiad, and there are many possible approaches, depending on the framework you use to write the rest of your program, and the window policy that you want to enact.

The basis of a tumbling window is a (custom) vertex that rewrites the timestamps of incoming messages (records) so that members of the same window have the same timestamp. Depending on the policy required, this vertex might gather all of the incoming messages in a single location to compute some predicate over them, it might be a binary vertex with a control input that determines when a new window should start, or it might even be part of a data-parallel stage wrapped in a feedback loop (so that parallel vertices could exchange data (e.g. counts)).

Once you produce a stream where the epoch number identifies each window, the existing frameworks can be used to process it. To use differential dataflow, you can Select each record of type R to make it a Weighted with weight 1, and pass this through a SlidingWindow(1) operator (which ensures that the stateful differential dataflow operators consider each window in isolation). You can also use the non-incremental Lindi framework straight out of the box, box it applies the operator logic independently on records with different timestamps.

Which of these cases best matches your problem?

I am looking to rolling aggregate counts.

Using twitter stream as an example, I would like to get the top N or all, '@mention' or '#hash' for a given time period minute, hour, day, week.

Is there documentation on the differences / characteristics between the Lindi vs. Differential Data Flow Frameworks?

From your explanation, I think the differential Data Flow would be most appropriate.

From what I understood in your comment I would need to derive a custom vertex from here:

https://github.com/MicrosoftResearchSVC/Naiad/blob/release_0.2/Naiad/Dataflow/Vertex.cs

Would I create an internal timer that would reset the windows, and rewrite the epoch's accordingly?

While using SlidingWindow(1) operator it, would only output results for the given epoch.