support for checkpoint-based updates

Question

support for checkpoint-based updates

davidbirdsong opened this issue 4 years ago · comments

Our Flink job deploys rely heavily on checkpoints since our savepoints take around 30 -45 minutes to write and read back in on the new job.

It appears that enabling of savepointDisabled gets us part of the way there and that there exists mechanisms for relying on checkpoints to recover a failing job.

We set ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION from within our job and we'd really like to be able to rely only on checkpoints to update jobs.

The way I envision still supporting checkpoints, say for when we need to change parallelism, would be to submit a new job with savepointDisabled disabled such that the next job update would use savepoints.

I'm happy to work on this if a PR would be accepted.

Lakshmi · Answer 1 · Thu Apr 30 2020 06:23:35 GMT+0800 (China Standard Time)

@davidbirdsong Thanks for posting this idea!

I think this would be a cool feature. We support some part of this behavior in the Recovering state, except that in the current state machine, the Recovering state is only ever reached when the savepoint fails. We could re-use some of this behavior to explicitly support a checkpoint based update.

cc/ @mwylde