Hook to Cerberus to understand the recovery time
chaitanyaenr opened this issue · comments
When running chaos scenarios, it's important to understand how long it takes for the cluster as a whole to be healthy post failure injection in order to find the areas to improve. Today, Kraken pings cerberus to pass/fail but doesn't track the duration for recovery - we are tracking the cerberus metrics manually to understand it.
It would be nice to have Kraken query cerberus post chaos scenario with a timeout and dump a json with the timing.
For example: track the time taken by the cluster to recover post zone outage.
@paigerube14 @yogananth-subramanian thoughts?