Improve cloud scheduler job alerts.
mariliamelo opened this issue · comments
TL;DR
We should have alerts that consistently tracks the forward progress of our cloud scheduler jobs.
Design
We want to make sure the system alerts if the following jobs aren't being executed according to the expectation.
E.g:
JWKS - importing at least once ever X minutes
database backup - successfully runs at least once every 12 hours
Mike to define the requirements for the other jobs.
Here is the list of cloud scheduler jobs we have:
Job | Frequency | When to be alerted | page Y/N |
---|---|---|---|
appsync-worker | Once a day | TBD | TBD |
backup-database-worker | Every 6 hours | TBD | TBD |
cleanup-worker | Every hour | TBD | TBD |
docker-mirror-worker | Once a day | TBD | TBD |
e2e-default-workflow | Every 5-10 mins | TBD | Yes |
e2e-revise-workflow | Every 5 - 10 mins | TBD | Yes |
e2e-enx-redirect-workflow | Every 5-10 mins | TBD | Yes |
modeler-worker | Once a day | TBD | TBD |
realm-key-rotation-worker | Every 30 mins | TBD | TBD |
rotation-worker | Every 5 mins | TBD | TBD |
stats-puller-worker | Every 30 mins | TBD | TBD |
Every job will be retried 3 times.
I'm separating the to be alerted and page columns in case we want to file a ticket instead of paging someone for a given job failure. We can customize to only alert when it is needed to page as well.
my quick suggestions
- appsync-worker - run 4x a day. Alert if it fails twice in a row.
- backup-database-worker - alert on all failures, page, playbook should have us run a manual backup
- cleanup worker - alert if no success for 6 consecutive hours in a row
- docker-mirror-worker - email on failures, do not page.
- e2e - page on each failure, but not if the scheduler fails to reach thyte container
- modeler - run 4x a day, alert if it fails twice ina. row
- relam-key-rotation-worker - change to every 15 minutes, alert if fails twice in a row
- rotation worker - change to every 30 minutes, alert if fails twice in a row
- stats-puller - change to 3x an hour - 10,20,30 after the hour. Alert if fails 3x in a row
In general, I think "fails twice in a row" is going to be harder to track than "didn't succeed in X minutes".
docker-mirror-worker - email on failures, do not page.
We can drop this service once the migration to GitHub Actions is complete
The following query assumes that the log entry of a specific cron job arrives at 1m interval, and the query gives you the sliding success ratio where the sliding window is 3 runs (3 minutes here). The output period is still 1m.
fetch cloud_scheduler_job::logging.googleapis.com/log_entry_count
| align delta(1m)
| filter resource.job_id == '<your_job_id>'
| { filter metric.severity == 'ERROR'
| group_by sliding(3m), .count
| group_by [], .sum
; ident
| group_by sliding(3m), .count
| group_by [], .sum
}
| outer_join [0]
| div
| every 1m
| within 1h
In chat, you asked about creating alerts for individual jobs. We can do this using Terraform and some interpolation. For example (untested):
variable "cloud_scheduler_job_alerts" {
type = set(object({
job_id = string
window = number
}))
default = set([
{ job_id = "rotation", window = "3m" },
// ...
])
}
Then create the alert like:
resource "google_monitoring_alert_policy" "cloud_scheduler" {
for_each = var.cloud_scheduler_job_alerts
display_name = each.job_id
// ...
conditions {
condition_monitoring_query_language {
query = <<EOT
fetch ... // use ${each.window} to interpolate in here
EOT
}
}
}
All background jobs have been switched to forward-progress alerting. This is done.