Improve cloud scheduler job alerts.

Question

Improve cloud scheduler job alerts.

mariliamelo opened this issue 3 years ago · comments

Marilia Melo commented 3 years ago

TL;DR

We should have alerts that consistently tracks the forward progress of our cloud scheduler jobs.

Design

We want to make sure the system alerts if the following jobs aren't being executed according to the expectation.

E.g:
JWKS - importing at least once ever X minutes
database backup - successfully runs at least once every 12 hours

Mike to define the requirements for the other jobs.

Marilia Melo · Answer 1 · Sat Feb 13 2021 02:22:06 GMT+0800 (China Standard Time)

Here is the list of cloud scheduler jobs we have:

Job	Frequency	When to be alerted	page Y/N
appsync-worker	Once a day	TBD	TBD
backup-database-worker	Every 6 hours	TBD	TBD
cleanup-worker	Every hour	TBD	TBD
docker-mirror-worker	Once a day	TBD	TBD
e2e-default-workflow	Every 5-10 mins	TBD	Yes
e2e-revise-workflow	Every 5 - 10 mins	TBD	Yes
e2e-enx-redirect-workflow	Every 5-10 mins	TBD	Yes
modeler-worker	Once a day	TBD	TBD
realm-key-rotation-worker	Every 30 mins	TBD	TBD
rotation-worker	Every 5 mins	TBD	TBD
stats-puller-worker	Every 30 mins	TBD	TBD

Every job will be retried 3 times.

I'm separating the to be alerted and page columns in case we want to file a ticket instead of paging someone for a given job failure. We can customize to only alert when it is needed to page as well.

Mike Helmick · Answer 2 · Sat Feb 13 2021 02:34:32 GMT+0800 (China Standard Time)

my quick suggestions

appsync-worker - run 4x a day. Alert if it fails twice in a row.
backup-database-worker - alert on all failures, page, playbook should have us run a manual backup
cleanup worker - alert if no success for 6 consecutive hours in a row
docker-mirror-worker - email on failures, do not page.
e2e - page on each failure, but not if the scheduler fails to reach thyte container
modeler - run 4x a day, alert if it fails twice ina. row
relam-key-rotation-worker - change to every 15 minutes, alert if fails twice in a row
rotation worker - change to every 30 minutes, alert if fails twice in a row
stats-puller - change to 3x an hour - 10,20,30 after the hour. Alert if fails 3x in a row

Seth Vargo · Answer 3 · Sat Feb 13 2021 02:37:54 GMT+0800 (China Standard Time)

In general, I think "fails twice in a row" is going to be harder to track than "didn't succeed in X minutes".

docker-mirror-worker - email on failures, do not page.

We can drop this service once the migration to GitHub Actions is complete

Marilia Melo · Answer 4 · Fri Mar 12 2021 17:07:30 GMT+0800 (China Standard Time)

The following query assumes that the log entry of a specific cron job arrives at 1m interval, and the query gives you the sliding success ratio where the sliding window is 3 runs (3 minutes here). The output period is still 1m.

  fetch cloud_scheduler_job::logging.googleapis.com/log_entry_count
  | align delta(1m)
  | filter resource.job_id == '<your_job_id>'
  | { filter metric.severity == 'ERROR'
      | group_by sliding(3m), .count
      | group_by [], .sum
    ; ident
      | group_by sliding(3m), .count
      | group_by [], .sum
    }
  | outer_join [0]
  | div
  | every 1m
  | within 1h

Seth Vargo · Answer 5 · Fri Mar 12 2021 22:06:48 GMT+0800 (China Standard Time)

In chat, you asked about creating alerts for individual jobs. We can do this using Terraform and some interpolation. For example (untested):

variable "cloud_scheduler_job_alerts" {
  type = set(object({
    job_id = string
    window = number
  }))

  default = set([
    { job_id = "rotation", window = "3m" },
    // ...
  ])
}

Then create the alert like:

resource "google_monitoring_alert_policy" "cloud_scheduler" {
  for_each = var.cloud_scheduler_job_alerts

  display_name = each.job_id
  // ...

  conditions {
    condition_monitoring_query_language {
      query = <<EOT
        fetch ... // use ${each.window} to interpolate in here
      EOT
    }
  }
}

Seth Vargo · Answer 6 · Fri Mar 26 2021 00:25:49 GMT+0800 (China Standard Time)

All background jobs have been switched to forward-progress alerting. This is done.