lyft / clutch

Extensible platform for infrastructure management

Home Page:https://clutch.sh

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

chaos_experimentation: scheduling of experiments

kathan24 opened this issue · comments

Description
A way to schedule recurring experiments will allow us to run experiments 24x7. Currently, the status of the experiment is determined from the start and end date. If you set the start date to a future date, the experiment's status will change on that date and time. However, this does not allow us to run recurring experiments.

One of the high-level approaches to make this feature available is below.
Pre-reqisite - the asynchronous task support feature

Plan

  • Start persisting the status of the experiments in Postgres database.
  • Add new status called STATUS_SCHEDULED
  • Create a new task using the Asynchronous task support. This task will check for all the experiments whose status == STATUS_SCHEDULED and start_time < current_time and update the status to STATUS_RUNNING. Same task or maybe a new one will terminate the experiment by setting the status to STATUS_COMPLETED
  • There will be knobs on the UI to select the recurring occurrences of the experiment.

Flow

  • When someone creates an experiment from the UI, the status will be set to STATUS_SCHEDULED.
  • Once the start date of the experiment becomes current, change the status to STATUS_RUNNING.
  • xDS server will pick up the experiments whose status is STATUS_RUNNING and inject faults.
  • Asnc task will then terminate the experiment when end_time == current_time and will scheduled next experiment

These details needs to be flushed and
Complexity [S/M/L]: L