DB down did not any alert

Question

DB down did not any alert

grpandurangipwc opened this issue 4 years ago · comments

We brought the DB down in the encv-staging for more than 30 minute.
But it did not generate any alert though the application was not available for use.
Unable to login as authorization was not working. E2E testing alert policy should have generated.
Kindly suggest.

Marilia Melo · Answer 1 · Tue Dec 15 2020 03:20:50 GMT+0800 (China Standard Time)

We don't recommend alerting on infrastructure failures. This didn't have any alerts because there were no real traffic to the system. If there were to be traffic hitting the system, the availability SLO alert would be triggered, since the system wouldn't be available.

That said, we expect e2e runner to get this kinda of error. Yuchen will be tuning the e2e error rate to see if this wouldn't triggered.

Marilia Melo · Answer 2 · Tue Dec 15 2020 03:21:04 GMT+0800 (China Standard Time)

/assign yegle

Yuchen Ying · Answer 3 · Tue Dec 15 2020 05:26:33 GMT+0800 (China Standard Time)

@grpandurangipwc What's the time range when you took down the database? I'll try to find some logs around it.

grpandurangipwc · Answer 4 · Tue Dec 15 2020 10:38:04 GMT+0800 (China Standard Time)

@yegle PSQL DB was down between 6:15 am ET to 10:20 am on 14th Dec, 2020.

grpandurangipwc · Answer 5 · Tue Dec 15 2020 10:42:27 GMT+0800 (China Standard Time)

ann

We don't recommend alerting on infrastructure failures. This didn't have any alerts because there were no real traffic to the system. If there were to be traffic hitting the system, the availability SLO alert would be triggered, since the system wouldn't be available.

That said, we expect e2e runner to get this kinda of error. Yuchen will be tuning the e2e error rate to see if this wouldn't triggered.

@mariliamelo There was no real traffic because it was development environment. Also we know that when DB is down, we cannot login as authorization will not work or issue code. There is huge impact on the availability of the application when DB is down. We need to alert when one is not able to login to MVS or issue a verification codes for more than 5-10 minutes though the URL and Cloud Run services are up.

Yuchen Ying · Answer 6 · Tue Dec 15 2020 10:56:42 GMT+0800 (China Standard Time)

Per the out of band chat, we confirmed there's only uptime check HTTP request to /health handler and they are all 200 OK.

Note currently we only have Availability SLO alert on apiserver service. The e2e-runner does hit apiserver but I checked its log and it looks like it was failing before reaching the code that would hit apiserver, presumably because the database was down.

So

It's WAI in the case to not have triggered Availability SLO alert. This can be fixed by adding Availability SLO alert to all Cloud Run services. I've submitted a PR to do that.
It should've triggered E2ETestFailure alert. I'll need to investigate why it did not alert.

grpandurangipwc · Answer 7 · Tue Dec 15 2020 12:38:56 GMT+0800 (China Standard Time)

@yegle Kindly let us know if you need any info from us. From the E2Erunner logs, we could see that codes and long term tokens were not issued. TEKs were not published. It errored for all that.
How it is considered as outlier in metrics explorer needs to be checked. When we tried to reproduce the alert query in the metrics explorer we could not see any deviations.

github-actions · Answer 8 · Tue Dec 29 2020 20:05:17 GMT+0800 (China Standard Time)

This issue is stale because it has been open for 14 days with no
activity. It will automatically close after 7 more days of inactivity.