TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to merge data of workers after execution?

zy1994-lab opened this issue · comments

I am learning Timely now. I have run it in a distributed environment. After each worker received and processed its data, I want to merge the data among workers in each machine at the end of dataflow excution. Does anyone know how to do it?

You can use the exchange operator (e.g. as described here) to send all outputs to a specific worker.

You can use the exchange operator (e.g. as described here) to send all outputs to a specific worker.

Thanks for your answer @comnik . I am considering using the exchange operator. But as I am processing the dataflow in batches, I want to cache the results in a vec and exchange them to the target worker after the last round of the dataflow. I didn't find a way to identify the last round of the dataflow. Is it possible to do it in Timely?

What is arguably more idiomatic is to exchange the data as they arrive (it is a streaming system!) and then to use the probe operator to inform you about when the stream is complete. The receiving worker would then be able to consolidate the results that it receives into a vector, or which ever other data structure would help.

Another possibly helpful example of this is how differential dataflow uses the Arranged structure and its trace (https://youtu.be/83rG471bmw8?t=1730) which if you are using differential dataflow will allow you to exfiltrate the data from the dataflow, and then interactively navigate the results at any point (including at the end of the computation).

I'm guessing it might be helpful for me to write (as a demo, possibly as a library method) a sink operator that perhaps takes a FnOnce<Vec<Data>> which gets called at the end of the computation at each worker, with whichever data ended up there. This probably leaves a bunch of performance on the floor (e.g. processing data as it arrives) but at least would give an example of how to write this.

Hi Frank, a sink operator will definitely help me (and possibly many other users) a lot. I totally agree with you point that Timely is a streaming system. But due to its high performance, I am using it to do some analytics tasks used to be done using frameworks like Spark. I have countered multiple examples in analytics tasks that you need to perform some operation at the end of the computation at each worker. Currently I have put such operations in the Drop trait for some pre-defined structure, which are moved into the dataflow, to let it execute in the end. However, it doesn't always fit my requirements and is very hard to write/read. So adding a sink operation is very good in my opinion!