Service Level Indicators (SLI) define the metrics we use to verify we are fulfilling theService Level Objectives (SLO)
- Example of a SLI related to "monthly uptime":
- Web service available over 99.5% of time during the last month
- Example of a SLI related to "request response time":
- REST API average response time below 100ms
- Latency: the time taken to serve a request (usually measured in ms)
- Traffic: the amount of stress on a system from demand (e.g. the number of HTTP requests/second)
- Errors: the number of requests that are failing (e.g. number of HTTP 50x responses)
- Saturation: the overall capacity of a service (e.g. the percentage of memory or CPU used)
- Uptime: the percentage of time of a hosting platform or a webservice has been operational and available
TROUBLE TICKET
Name: Bug found in frontend leads to error 404
Date: 04.12.2021
Subject: There is a bug in the frontend which calls a non existing API in the backend. This is causing high generation of 404 (not found)
Affected Area: Homepage
Severity: Medium
Description: When the user clicks on a button, the frontend is calling a non-existing API in the backend, which generates a 404 error
- Uptime of services in percentage
- CPU and memory usage in percentage
- Percentage of HTTP status code is 20x
- Average response time
- Service uptime of services must be >99.99%. This is above the 99.95% objective and therefore guaranteeing that we will be compliant.
- CPU usage must be < 80%. Getting CPU beyond this could be a problem of saturation. We should in such case scale out increasing the number of replicas.
- Memory usage must be < 80%. Getting RAM beyond this could be a problem of saturation. We should in such case scale out increasing the number of replicas.
- Average response time of HTTP requests must be < 100ms. Longer response times may be perceived as if the system is not available, pontentially leading to timeout issues.
The first screenshot shows the dashboard including the panels for Memory usage, CPU usage and the response time of requests. Ideally, we would generate the logic to automatically trigger an alarm when the desired values are surpassed.
The second screenshot shows the last panel at the bottom, which represents the up time. This can be used to verify the system is up and running. It would be needed to calculate somewhere the percentage of time the system has been up, and trigger an alarm when the system is down and could not recover after some time (e.g. 120 seconds).