ocurrent / ocaml-ci

A CI for OCaml projects

Home Page:https://ocaml.ci.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SLA / service outage

hannesm opened this issue · comments

Dear Madam or Sir,

today it looks like the web service is available, but there's nothing happening for the last hour -- i.e. https://ocaml.ci.dev/github/hannesm/mirage-crypto/commit/0573d45d5eb6f119624864843cc48af2ded9eb5c/variant/%28analysis%29 is still in "analysis" / "0s in queue".

Since I asked the other day what the status of this service is, and OCaml-CI is considered to be stable, I wanted to ask whether there's some dashboard / status page about service interruptions?

Best,

Hannes

NB: just when I opened this issue, the analysis job started to make progress. So please ignore the first paragraph, while the second still holds.

Hello @hannesm !

A public status page doesn't currently exist, though for significant outages we do post on the infra blog.

Thanks for your comment @rikusilvola. Since yesterday afternoon, there's again first an outage, and now temporary failures.

I'm still wondering what is the Service Level that you intend to deliver? What are "significant outages" that are getting posted to the "infra blog"?

Indeed, several minor outages were experienced for OCaml-CI in the past few days. With increased load, the service became unresponsive but was recovered within a couple of hours each time. Initial investigations point to lwt starvation leading to the web interface getting stuck.

The services are provided with best-effort support, meaning that once an issue is noticed, it is treated during business hours according to its relative criticality. Most of the time, what is perceived as an outage is a reduced quality of service due to a temporary spike in activity. These outages are commonly transient, and the service is restored without human intervention.

Here are some examples of posts for significant outages

I welcome you to report any outage you experience on ocaml/infrastructure.

Thanks for your reply. What I understand (please correct me if I'm wrong) that "during office hours [unclear where], the service is maintained as we see fits [with some priority]". There's no SLA, human intervention is required for restarting / restoring the service when there is a spike in activity.

Most of the time, what is perceived as an outage is a reduced quality of service

You mean the 500 - internal server error - I get at the moment are "reduced quality of service"?

In any case, thanks for providing the free service. I'll close my issues and hope you'll eventually find time and energy to setup monitoring and more reliability.