google / exposure-notifications-verification-server

Verification component for COVID-19 Exposure Notifications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve cloud scheduler job alerts.

mariliamelo opened this issue · comments

TL;DR

We should have alerts that consistently tracks the forward progress of our cloud scheduler jobs.

Design

We want to make sure the system alerts if the following jobs aren't being executed according to the expectation.

E.g:
JWKS - importing at least once ever X minutes
database backup - successfully runs at least once every 12 hours

Mike to define the requirements for the other jobs.

Here is the list of cloud scheduler jobs we have:

Job Frequency When to be alerted page Y/N
appsync-worker Once a day TBD TBD
backup-database-worker Every 6 hours TBD TBD
cleanup-worker Every hour TBD TBD
docker-mirror-worker Once a day TBD TBD
e2e-default-workflow Every 5-10 mins TBD Yes
e2e-revise-workflow Every 5 - 10 mins TBD Yes
e2e-enx-redirect-workflow Every 5-10 mins TBD Yes
modeler-worker Once a day TBD TBD
realm-key-rotation-worker Every 30 mins TBD TBD
rotation-worker Every 5 mins TBD TBD
stats-puller-worker Every 30 mins TBD TBD

Every job will be retried 3 times.

I'm separating the to be alerted and page columns in case we want to file a ticket instead of paging someone for a given job failure. We can customize to only alert when it is needed to page as well.

my quick suggestions

  • appsync-worker - run 4x a day. Alert if it fails twice in a row.
  • backup-database-worker - alert on all failures, page, playbook should have us run a manual backup
  • cleanup worker - alert if no success for 6 consecutive hours in a row
  • docker-mirror-worker - email on failures, do not page.
  • e2e - page on each failure, but not if the scheduler fails to reach thyte container
  • modeler - run 4x a day, alert if it fails twice ina. row
  • relam-key-rotation-worker - change to every 15 minutes, alert if fails twice in a row
  • rotation worker - change to every 30 minutes, alert if fails twice in a row
  • stats-puller - change to 3x an hour - 10,20,30 after the hour. Alert if fails 3x in a row

In general, I think "fails twice in a row" is going to be harder to track than "didn't succeed in X minutes".

docker-mirror-worker - email on failures, do not page.

We can drop this service once the migration to GitHub Actions is complete

The following query assumes that the log entry of a specific cron job arrives at 1m interval, and the query gives you the sliding success ratio where the sliding window is 3 runs (3 minutes here). The output period is still 1m.

  fetch cloud_scheduler_job::logging.googleapis.com/log_entry_count
  | align delta(1m)
  | filter resource.job_id == '<your_job_id>'
  | { filter metric.severity == 'ERROR'
      | group_by sliding(3m), .count
      | group_by [], .sum
    ; ident
      | group_by sliding(3m), .count
      | group_by [], .sum
    }
  | outer_join [0]
  | div
  | every 1m
  | within 1h

In chat, you asked about creating alerts for individual jobs. We can do this using Terraform and some interpolation. For example (untested):

variable "cloud_scheduler_job_alerts" {
  type = set(object({
    job_id = string
    window = number
  }))

  default = set([
    { job_id = "rotation", window = "3m" },
    // ...
  ])
}

Then create the alert like:

resource "google_monitoring_alert_policy" "cloud_scheduler" {
  for_each = var.cloud_scheduler_job_alerts

  display_name = each.job_id
  // ...

  conditions {
    condition_monitoring_query_language {
      query = <<EOT
        fetch ... // use ${each.window} to interpolate in here
      EOT
    }
  }
}

All background jobs have been switched to forward-progress alerting. This is done.