Dynamically adding/removing workers

Question

Dynamically adding/removing workers

conradoverta opened this issue a year ago · comments

Hi, folks!

I apologize if this has been asked before, but I searched through the issues and only found two tangential ones (#182 and #319).

I'm curious about the possibility of adding and removing workers from a data flow cluster. More specifically, thinking about leveraging cloud where we might lose workers because nodes are replaced or we can add more to scale up. I've checked the documentation and it seems like the list of workers is static.

Is there any documentation about this? How would we manage to modify the cluster configuration?

Thanks!

Frank McSherry · Answer 1 · Sat Aug 19 2023 05:01:56 GMT+0800 (China Standard Time)

Hello! It's a great question.

I think the short answer is "no". It is challenging to add and remove workers, as much of the shared state they use to coordinate relies on static assumptions about e.g. their number. There was some work at ETHZ (@antiguru or @utaal: do either of you recall details?) on dynamic workers, I think virtualizing the idea of a worker, and allowing with some interruption the reconfiguration. However, "losing" workers seems very problematic, as they are potentially indistinguishable from a very busy worker who needs to be awaited.

What I might suggest instead is to think of the workers and their operators as async tasks that can themselves farm out work, potentially to an elastic set of cloud workers, leaving the operators with the role of confirming and locking in completed work. This gives you access to timely dataflow's progress tracking, while allowing you to step away from its "exactly this many workers" threadpool model. The workers could shuttle resulting data around, or could shuttle references to cloud storage (e.g. S3 blob names) that the cloud workers could then pick up.

conradoverta · Answer 2 · Sun Aug 20 2023 04:40:39 GMT+0800 (China Standard Time)

Thanks for the reply, Frank!

That's a very interesting idea! So basically have timely handle the core consistency and work scheduling based on the flow specified, while the bigger state and worker is farmed outside. This could make the timely clusters pretty thin, which makes failure (e.g. network partition) much less likely.

I imagine, as part of the dataflow, we could save the smaller timely state (e.g. S3 locations) pretty frequently and reload the state at boot time to continue work in case of failure. This could allow pretty ephemeral clusters that can be restored after "every" step.

Is there some place (timely docs or external) that you'd recommend to help me think about state saving and resume? I have a rough idea what that would look like, but I'm sure there are a lot of nuance that I haven't thought about yet.