Logging vs storing checkpoints

Question

Logging vs storing checkpoints

iakkus opened this issue 4 years ago · comments

Currently, the function worker stores the following in the data layer as part of the "checkpointing" procedure after a function execution finishes:

execution result of the current execution
the input to the next function(s)

The purpose is to aid a "recovery manager" to retry failed executions and start from the latest checkpoint if possible. The recovery manager would also act as a progress tracker for workflow executions via progress messages sent to it during the checkpointing procedure, such that failed executions can be restarted from the latest checkpoint. (The recovery manager has not been implemented.)

We have support for catch and retry in the workflow description, such that a developer can specify which errors to look for, but that's developer-specific. The goal of the recovery manager would be to make certain failures in the infrastructure transparent to the developers.

We had an offline discussion about the possibility to put this checkpoint information to the "log" instead of the datalayer a while ago. The main advantage would be 1) less frequent data layer accesses, which can be slow and create unnecessary load, 2) faster function interaction latency. The disadvantage would be that the recovery manager would have to sift through the log of a workflow to find the appropriate backups.

Should we continue this approach and build the recovery manager? If so, perhaps we should reconsider the "logging" approach.

ksatzke · Answer 1 · Mon Oct 05 2020 15:20:56 GMT+0800 (China Standard Time)

RM sounds like an interesting approach to give more insight into the workflow execution status and progress monitoring. Wouldn't this recovery manager also offer new opportunities to investigate and visualise the workflow progress and errors in a more concise way than our current approach?
BTW, what is meant by "transparency of infrastructure failures"? Isn't this tool oriented to the developer's need to debug his workflows, or would it rather be an infrastructure management tool (infrastructure status/health check)?

Istemi Ekin Akkus · Answer 2 · Mon Oct 05 2020 15:55:34 GMT+0800 (China Standard Time)

It was envisioned as an infrastructure management tool. Some failures may be due to the underlying infrastructure (e.g., crashed nodes) and cannot be anticipated by the developers (and probably cannot be expressed in their catch and retry descriptions).

I don't think it would provide additional information regarding the actual progress of the workflow execution. If I recall correctly, the current approach was an alternative of the progress tracker for visualization. In fact, it is exactly the same information it would get as the progress log that we use for visualization.

The reason why the recovery manager would also act as a progress tracker is that it requires to know the progress of the workflow in order to take the correct recovery actions; not because it would be managing the progress of the workflow execution in the failure-free case.

Ruichuan Chen · Answer 3 · Mon Oct 05 2020 17:06:09 GMT+0800 (China Standard Time)

It indeed makes more sense to use logging infrastructure in this regard. Sifting through the log shouldn't create an issue as ElasticSearch enables various searches.

…

On Mon, Oct 5, 2020 at 8:58 AM Istemi Ekin Akkus ***@***.***> wrote: Currently, the function worker stores the following in the data layer as part of the "checkpointing" procedure after a function execution finishes: 1. execution result of the current execution 2. the input to the next function(s) The purpose is to aid a "recovery manager" to retry failed executions and start from the latest checkpoint if possible. The recovery manager would also act as a progress tracker for workflow executions via progress messages sent to it during the checkpointing procedure, such that failed executions can be restarted from the latest checkpoint. (The recovery manager has not been implemented.) We have support for catch and retry in the workflow description, such that a developer can specify which errors to look for, but that's developer-specific. The goal of the recovery manager would be to make certain failures in the infrastructure transparent to the developers. We had an offline discussion about the possibility to put this checkpoint information to the "log" instead of the datalayer a while ago. The main advantage would be 1) less frequent data layer accesses, which can be slow and create unnecessary load, 2) faster function interaction latency. The disadvantage would be that the recovery manager would have to sift through the log of a workflow to find the appropriate backups. Should we continue this approach and build the recovery manager? If so, perhaps we should reconsider the "logging" approach. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#85>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQSEDAT25HSSD7ST4MO4HLSJFVBXANCNFSM4SEJIRLQ> .

Istemi Ekin Akkus · Answer 4 · Mon Oct 05 2020 17:47:24 GMT+0800 (China Standard Time)

The data layer would still receive the result of the entire workflow (i.e., when publishing to the exit topic), such that the sandbox frontend can find it in the data layer if needed, especially for 'async' execution. However, that's just for the last function in a workflow.

Istemi Ekin Akkus · Answer 5 · Sat Oct 10 2020 17:37:47 GMT+0800 (China Standard Time)

The recovery manager implementation should be a separate issue. Changed the title accordingly.

Istemi Ekin Akkus · Answer 6 · Sun Oct 18 2020 00:34:36 GMT+0800 (China Standard Time)

@paarijaat @manuelstein @abeckn: Any other comments/thoughts?