SLA / service outage

Question

SLA / service outage

hannesm opened this issue 10 months ago · comments

Dear Madam or Sir,

today it looks like the web service is available, but there's nothing happening for the last hour -- i.e. https://ocaml.ci.dev/github/hannesm/mirage-crypto/commit/0573d45d5eb6f119624864843cc48af2ded9eb5c/variant/%28analysis%29 is still in "analysis" / "0s in queue".

Since I asked the other day what the status of this service is, and OCaml-CI is considered to be stable, I wanted to ask whether there's some dashboard / status page about service interruptions?

Best,

Hannes

Hannes Mehnert · Answer 1 · Mon Sep 18 2023 22:55:12 GMT+0800 (China Standard Time)

NB: just when I opened this issue, the analysis job started to make progress. So please ignore the first paragraph, while the second still holds.

Riku Silvola · Answer 2 · Wed Sep 20 2023 14:47:42 GMT+0800 (China Standard Time)

Hello @hannesm !

A public status page doesn't currently exist, though for significant outages we do post on the infra blog.

Hannes Mehnert · Answer 3 · Thu Oct 05 2023 20:24:42 GMT+0800 (China Standard Time)

Thanks for your comment @rikusilvola. Since yesterday afternoon, there's again first an outage, and now temporary failures.

I'm still wondering what is the Service Level that you intend to deliver? What are "significant outages" that are getting posted to the "infra blog"?

Riku Silvola · Answer 4 · Fri Oct 06 2023 16:40:25 GMT+0800 (China Standard Time)

Indeed, several minor outages were experienced for OCaml-CI in the past few days. With increased load, the service became unresponsive but was recovered within a couple of hours each time. Initial investigations point to lwt starvation leading to the web interface getting stuck.

The services are provided with best-effort support, meaning that once an issue is noticed, it is treated during business hours according to its relative criticality. Most of the time, what is perceived as an outage is a reduced quality of service due to a temporary spike in activity. These outages are commonly transient, and the service is restored without human intervention.

Here are some examples of posts for significant outages

I welcome you to report any outage you experience on ocaml/infrastructure.

Hannes Mehnert · Answer 5 · Sat Oct 07 2023 20:49:45 GMT+0800 (China Standard Time)

Thanks for your reply. What I understand (please correct me if I'm wrong) that "during office hours [unclear where], the service is maintained as we see fits [with some priority]". There's no SLA, human intervention is required for restarting / restoring the service when there is a spike in activity.

Most of the time, what is perceived as an outage is a reduced quality of service

You mean the 500 - internal server error - I get at the moment are "reduced quality of service"?

In any case, thanks for providing the free service. I'll close my issues and hope you'll eventually find time and energy to setup monitoring and more reliability.