knix-microfunctions / knix

Serverless computing platform with process-based lightweight function execution and container-based application isolation. Works in Knative and bare metal/VM environments.

Home Page:https://knix.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Logging vs storing checkpoints

iakkus opened this issue · comments

Currently, the function worker stores the following in the data layer as part of the "checkpointing" procedure after a function execution finishes:

  1. execution result of the current execution
  2. the input to the next function(s)

The purpose is to aid a "recovery manager" to retry failed executions and start from the latest checkpoint if possible. The recovery manager would also act as a progress tracker for workflow executions via progress messages sent to it during the checkpointing procedure, such that failed executions can be restarted from the latest checkpoint. (The recovery manager has not been implemented.)

We have support for catch and retry in the workflow description, such that a developer can specify which errors to look for, but that's developer-specific. The goal of the recovery manager would be to make certain failures in the infrastructure transparent to the developers.

We had an offline discussion about the possibility to put this checkpoint information to the "log" instead of the datalayer a while ago. The main advantage would be 1) less frequent data layer accesses, which can be slow and create unnecessary load, 2) faster function interaction latency. The disadvantage would be that the recovery manager would have to sift through the log of a workflow to find the appropriate backups.

Should we continue this approach and build the recovery manager? If so, perhaps we should reconsider the "logging" approach.

RM sounds like an interesting approach to give more insight into the workflow execution status and progress monitoring. Wouldn't this recovery manager also offer new opportunities to investigate and visualise the workflow progress and errors in a more concise way than our current approach?
BTW, what is meant by "transparency of infrastructure failures"? Isn't this tool oriented to the developer's need to debug his workflows, or would it rather be an infrastructure management tool (infrastructure status/health check)?

It was envisioned as an infrastructure management tool. Some failures may be due to the underlying infrastructure (e.g., crashed nodes) and cannot be anticipated by the developers (and probably cannot be expressed in their catch and retry descriptions).

I don't think it would provide additional information regarding the actual progress of the workflow execution. If I recall correctly, the current approach was an alternative of the progress tracker for visualization. In fact, it is exactly the same information it would get as the progress log that we use for visualization.

The reason why the recovery manager would also act as a progress tracker is that it requires to know the progress of the workflow in order to take the correct recovery actions; not because it would be managing the progress of the workflow execution in the failure-free case.

The data layer would still receive the result of the entire workflow (i.e., when publishing to the exit topic), such that the sandbox frontend can find it in the data layer if needed, especially for 'async' execution. However, that's just for the last function in a workflow.

The recovery manager implementation should be a separate issue. Changed the title accordingly.

@paarijaat @manuelstein @abeckn: Any other comments/thoughts?