cookpad / barbeque

Job queue system to run job with Docker

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Split Runner to Runner and Poller

eagletmt opened this issue · comments

Current barbeque-worker runs specified command as follows.

  1. Set execution status running
  2. Spawn docker run cmd... or hako oneshot cmd...
  3. Wait process of 2
  4. Set execution status from exit code of 2

Problem

It's simple enough but there's some problems.

  • Vulnerable to server failures and deployments
    • Especially when hako is used, even if barbeque-worker goes down unexpectedly while running hako, the enqueued job is still running in another host (ECS container instance). barbeque-worker loses the running job completely and cannot recover from it.
    • When barbeque-worker is running on ECS, due to ECS limitation, the worker process gets SIGKILL after 30s by default when deploying (UpdateService API). We cannot deploy barbeque-worker when long-running job is being executed.
  • Less scalable
    • The maximum number of concurrent job executions is limited to the number of barbeque-worker. Especially when hako is used, barbeque-worker doesn't consume so much server resources because hako executes the job in another host (ECS container instance). We can execute more jobs if the ECS cluster has enough capacity.

Solution

Split Barbeque::Runner into two parts: Runner and Poller.

  1. (Runner) Start execution by docker run --detach cmd... or hako oneshot --no-wait cmd...
  2. (Runner) Set execution status running
  3. (Poller) Check status of execution periodically
  4. (Poller) When the execution finishes, set the execution status, stdout and stderr.

I call these Runner and Poller as Executor. Executor can be customized just like the current Runner.

Pros

  • Tolerant to server failures and deployments
    • The executed command is managed by Docker or ECS and its identifier is stored to DB. They're more fault-tolerant than Barbeque :trollface:
    • We might be possible to recover from server failures with stored identifiers.
  • More scalable
    • Now we can spawn more jobs concurrently than the number of barbeque-worker.

Cons

  • Unable to control the maximum number of concurrent job executions
    • Runner should keep the maximum number of running executions?

cc: @cookpad/dev-infra @k0kubun

Not seriously considered for now (I'll take a look later), but that sounds better architecture at a glance 👍

For Hako, Poller checks status by S3 task notification introduced in v1.3.0.

I want to know the details of Poller implementation direction. How does it decide which task to poll? As far as I understand, process that spawns a task and one that polls a task are different. In other words, which process to poll was obvious if they were not distributed (a process should poll a task which the process spawned), but it will be arbitrary as distributed.

Will it poll the task whose created_at is earliest and which is not polled by other poller? If so, will it get stuck when the polled task takes much time? Or, will it randomly select a task to poll every time?

With my current understanding (one poller process can't poll multiple tasks), it has 2 problems:

  • Which task to poll is not straight-forward
  • It may have latency between that a task is actually finished and that a task is marked as finished by barbeque
    • As far as I can see from this line, it may have 1 second latency at maximum.
    • In that case, running jobs metrics will be incorrect and harmful for monitoring, and will make it hard to correctly introduce some auto-scale (for barbeque-worker side. but not seriously considered) using that metrics.

Then, what do you think about receiving S3 task notification event via SQS? It would be cost-efficient for polling if all tasks take long time and solve "which task to poll" problem.

I'm sorry but I don't consider about Docker runner counterpart 🙃

How does it decide which task to poll?

Randomly selected from all running job executions. One polling step for each execution is just docker inspect(Docker case) or s3:GetObject (Hako case), so it doesn't take so much time.

It may have latency between that a task is actually finished and that a task is marked as finished by barbeque

Right, but actual finished time of the task can be obtained from docker inspect (Docker case) or task status JSON (Hako case).
The timing of updating status column of job_executions table have delays, but finished_at column is set correctly.

The poller process acts like follows.

loop do
  Barbeque::JobExecution.running.shuffle.each do |job_execution|
    task_identifier = find_task_identifier_from_db(job_execution)
    if ecs_task_stopped?(task_identifier)
      task = get_task_result(task_identifier)
      job_execution.update!(status: task.success? ? :success : :failed, finished_at: task.stopped_at)
    end
  end
  sleep(interval)
end

Moreover, hako oneshot is also doing polling from ECS or S3.
The previous Barbeque::Runner::Hako implementation sets finished_at column with the time of hako oneshot process, so finished_at column also have such latency.
On the other hand, Barbeque::Executor::Hako can set finished_at column correctly because it can access the backend service (Docker or ECS) directly.
In other words, the polling process is moved from hako oneshot to Barbeque::Executor::Hako's poller.

Then, what do you think about receiving S3 task notification event via SQS?

Yes, that's exactly what I'd like to implement in the next big step. It must be able to poll more efficiently.

loop
  message = sqs_client.receive_message
  task = extract_task_info(message)
  if task.stopped?
    job_execution = find_execution_from_task_info(task)
    if job_execution
      job_execution.update!(status: task.success? ? :success : :failed, finished_at: task.stopped_at)
    end
  end
end

I have to implement such feature in Hako at first, then support the feature in Barbeque (and Kuroko2!).
Using SQS for polling is out of scope of this issue.

SPOILER: I'm wrinting complete patch for this issue on this branch https://github.com/eagletmt/barbeque/tree/runner-and-poller

I see. It totally made sense for that part. Having shuffle poller as the next step sounds a reasonable decision.

Runner should keep the maximum number of running executions?

If ECS cluster scale-in is properly implemented, infinite scale-out wouldn't be a problem in barbeque side. However, it would be problematic in executed application side like following situations:

  • External API of some SaaS will be excessively called and rate-limited
  • Internal API of another application will be excessively called and down
  • Some access spike will enqueue many jobs and introduce additional pressure for database

While ECS scale-out wouldn't be so fast and such cases would hardly be problematic, it would be better to think about a way to introduce the limit of running executions (per application?).

Unable to control the maximum number of concurrent job executions

For casual and easy-to-implement way (but not scalable), we can query count of running executions every time from now. Another way would be adding a column to applications table and manage count in it by increment/decrement with single query (if we don't want to manage redis).

Some access spike will enqueue many jobs and introduce additional pressure for database

I'm especially concerning about this situation. Barbeque jobs can spike when the end-user access spikes, which is often unpredictable to us. Running too many jobs concurrently could
give excessive load to databases, cache stores, and other services.

I will try the easiest way of keeping Barbeque::JobExecution.running.count <= HARD_LIMIT .
Setting per-application limit also looks nice idea. I will implement it if our environment needs it.

+1 for "Setting per-application limit" for concurrency problems.

After having a short grace at this and #38, is "latency issue for monitoring" already solved? It's related with this sentence:

In that case, running jobs metrics will be incorrect and harmful for monitoring, and will make it hard to correctly introduce some auto-scale (for barbeque-worker side. but not seriously considered) using that metrics.

Will this problem be solved by getting actual finished time? I mean, can we ignore the latency between the moment when the job actually finished and the moment when the poller got actual finished time?

Aha, or the problem might be out-of scope in this issue:

Using SQS for polling is out of scope of this issue.

can we ignore the latency between the moment when the job actually finished and the moment when the poller got actual finished time?

Yes. As I answered in #32 (comment) , the current implementation of executing hako oneshot foo.yml also has such latency because hako oneshot is also doing polling.
The latency might be harmful, but my patch doesn't make it worse.

Implemented in #38