RFC: Improve robustness through worker queues

Question

RFC: Improve robustness through worker queues

TimJentzsch opened this issue 2 years ago · comments

Robustness is one of the most important properties for the bot. The Reddit API often has hiccups and we need to ensure that the bot still works correctly. Unfortunately, we sometimes have instances where the bot crashes if the Reddit API is having big problems. We also have problems of post duplication and other unexpected behavior when some API calls fail and a whole sequence of operations is repeated, even though the first operations succeeded previously.

I propose a complete rework on how we interact with the API in order to improve robustness.

System Overview

Worker Queue

The first element of the system is a worker queue. Instead of executing tasks directly, we add them to a worker, then we execute a worker cycle. The cycle will try to execute every task one-by-one. If an operation fails it either a) retries the operation directly (up to a maximum amount of times) b) enqueues it to the next cycle (up to an optional maximum amount of times) or c) discards it.

This gives us a very powerful system to design robust workflows. For example, if the bot fails to submit a post to the queue, it could try to retry one time and otherwise enqueue the task again. That way we don't indefinitely block the other waiting tasks, while not losing a post from the queue.

Task Sequences

It's often the case that we have a sequence of tasks that depend on each other and need to be executed in a fixed order. If finish a task successfully, we move on towards the next task in the sequence. If it fails, we can retry that task directly, or we need to enqueue the entirety of remaining tasks in the next queue. They depend on the previous tasks, so if we skip one of them, all remaining tasks need to be skipped as well.

Adding a class to handle these task sequences and add them to the worker directly will be central to many workflows. For example, we first want to submit a post to the ToR queue, then post a comment with the guidelines to the submission and then submit the post to Blossom. Of course, we need to have submitted the post successfully to be able to comment on it. Thus, it needs to be a task sequence instead of individual tasks.

Error Logs

A key aspect of this system will be to respond to high API error rates. For example, if we get a lot of errors when posting Reddit comments, we should suspend all tasks requiring comments for a given amount of time. This will reduce erroneous posts in the queue and avoid stressing the Reddit API when it's already having problems, simplifying recovery. Additionally, detecting high error rates allows us to report them to mod Slack, enabling us to notify the volunteers that the bot is having issues.

The error response could be structured in multiple severity levels. If a high error rate is detected, level 1 is enabled and the operation is suspended for e.g. 5 minutes. If after the cooldown the error rate is still high, we increase the level to 2 and suspend operations for e.g. 10 minutes, etc. Once the error rate is low again, we can gradually decrease the level again.

Of course this requires us to track every call to these APIs, both errors and successes. It also requires us to define which task (sequences) depend on which APIs, so that we can suspend them appropriately.

Advantages

This approach has three major advantages: 1) We practically eliminate the chance of the bot crashing because of API errors, 2) we reduce the chance of unintended behavior (e.g. duplicated posts in the queue) because of API errors, 3) we can more easily and quickly track/respond to API failures and bot downtime.

Challenges

This would require rather big changes to the codebase. Designing this system will not be an easy task. Two main problems I see is 1) making this approach testable and 2) allowing data flows between task sequences. For 1) we probably need to work a lot with dependency injection. For 2) we have the problem that e.g. when we submit a post and then want to comment on it, the comment needs the ID of the post. So somehow the first task needs to be able to pass data to the second task in an ergonomic way.

TimJentzsch · Answer 1 · Wed Apr 13 2022 20:27:40 GMT+0800 (China Standard Time)

Notably, if we design a useful system, we can also extract that functionality into a new repository and reuse it in the other bots.

TimJentzsch · Answer 2 · Wed Apr 13 2022 20:30:30 GMT+0800 (China Standard Time)

A solution for the task sequences and the data flow could be that the tasks can return a list of tasks after execution, which are then queued immediately after the current task. Then the task can inject the data for the dependent tasks and we have a relatively simple model to implement for the task executor (I think).
This would also make it easier to design tasks that spawn a dynamic number of other tasks which are not known beforehand.