lyft / clutch

Extensible platform for infrastructure management

Home Page:https://clutch.sh

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

chaos_experimentation: real-time stats

kathan24 opened this issue · comments

Description
To perform aggressive experiments, there needs to be a tight metric-driven feedback system. This will ensure that we can quickly terminate the experiments before it affects our drivers and passengers.

One way to get real-time per-second stats from the service is to use Envoy Proxy's Load Reporting Service (LRS) API. This API provides metrics that can help to determine the success rate of the service. Since the chaos experimentation framework comes with an xDS server, we should use the same to host the LRS server. We will also have to use a Timeseries Database (TSDB) that will help aggregate the stats from each host of the service.

Complexity [S/M/L]: L