google / exposure-notifications-verification-server

Verification component for COVID-19 Exposure Notifications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DB down did not any alert

grpandurangipwc opened this issue · comments

We brought the DB down in the encv-staging for more than 30 minute.
But it did not generate any alert though the application was not available for use.
Unable to login as authorization was not working. E2E testing alert policy should have generated.
Kindly suggest.

We don't recommend alerting on infrastructure failures. This didn't have any alerts because there were no real traffic to the system. If there were to be traffic hitting the system, the availability SLO alert would be triggered, since the system wouldn't be available.

That said, we expect e2e runner to get this kinda of error. Yuchen will be tuning the e2e error rate to see if this wouldn't triggered.

/assign yegle

@grpandurangipwc What's the time range when you took down the database? I'll try to find some logs around it.

@yegle PSQL DB was down between 6:15 am ET to 10:20 am on 14th Dec, 2020.

ann

We don't recommend alerting on infrastructure failures. This didn't have any alerts because there were no real traffic to the system. If there were to be traffic hitting the system, the availability SLO alert would be triggered, since the system wouldn't be available.

That said, we expect e2e runner to get this kinda of error. Yuchen will be tuning the e2e error rate to see if this wouldn't triggered.

@mariliamelo There was no real traffic because it was development environment. Also we know that when DB is down, we cannot login as authorization will not work or issue code. There is huge impact on the availability of the application when DB is down. We need to alert when one is not able to login to MVS or issue a verification codes for more than 5-10 minutes though the URL and Cloud Run services are up.

Per the out of band chat, we confirmed there's only uptime check HTTP request to /health handler and they are all 200 OK.

Note currently we only have Availability SLO alert on apiserver service. The e2e-runner does hit apiserver but I checked its log and it looks like it was failing before reaching the code that would hit apiserver, presumably because the database was down.

So

  1. It's WAI in the case to not have triggered Availability SLO alert. This can be fixed by adding Availability SLO alert to all Cloud Run services. I've submitted a PR to do that.
  2. It should've triggered E2ETestFailure alert. I'll need to investigate why it did not alert.

@yegle Kindly let us know if you need any info from us. From the E2Erunner logs, we could see that codes and long term tokens were not issued. TEKs were not published. It errored for all that.
How it is considered as outlier in metrics explorer needs to be checked. When we tried to reproduce the alert query in the metrics explorer we could not see any deviations.

This issue is stale because it has been open for 14 days with no
activity. It will automatically close after 7 more days of inactivity.