chaostoolkit / chaostoolkit

Chaos Engineering Toolkit & Orchestration for Developers

Home Page:https://chaostoolkit.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a mechanism to protect the system from running the experiment

Lawouach opened this issue · comments

As part of a recent discussion, it has been requested we add a new element to the experiment language.

An experiment file declares the blocks that define the experiment protocol: hypothesis, method and rollbacks/remediation. However, as experiments gets played into production, some operational concerns come to light. One of which is to be able to interrupt an execution based on system decisions outside the experiment flow itself. As suggested in the other thread, this could be the case if your production system is under attack or goes through some issues of any sort.

Right now, Chaos Toolkit has no built in mechanism to ask if it should continue running. It's up to the operator to get that information out of band and interrupt through a signal the chaos process from the outside.

Throughout the discussion, I was initially not in favor of adding such a construct to the language itself when there are alternatives, such as the one we just mentioned (signals) or through controls as they are meant for these orthoginal operational concerns.

But, I finally appreciated the context here. To go into a production system, an experiment has to prove it will be a good citizen. To achieve that it is indeed a good idea to expose, in the experiment, a first class citizen block that says "this is how I ensure I play nice with the system".

To that end, I suggest something similar to the original thread: a new element called safeguards. This element shall be a sequence of probes with a tolerance. The point made by Alexander about resuing known bricks is spot on. These probes would query the system with a question "can I carry on?" and if the system says "no", then the experiment should interrupt itself as soon as possible technically.

This is therefore the specification proposal:

Safeguards

The Safeguards element is OPTIONAL. It describes when the experiment MUST be interrupted as soon as possible. 

The Safeguards element is a JSON array of Probe elements.

Each Probe MUST define a tolerance property that acting as a gate mechanism for the experiment to carry on or terminate as soon as possible. Any Probe that does not fall into the tolerance zone MUST interrupt the experiment.

Safeguards MAY declare controls.

Safeguards Probes MUST be executed at least once during the experiment.

In addition, the Chaos Toolkit must accomodate this new element. It is suggested the probes are run in the background during the experiment with a specified frequency. Because the safeguards element can declare controls, they can be manipulated at runtime the same way other elements can. This is mostly useful to disable or change the behavior of a safeguard probe at runtime.

Finally, the chaos run command will grow new flags to define the runtime strategy of the probes: what frequency, should they all fail to interrupt or should just one of them failing have that power?

Is this the way you see the usage of probes with new functionality?


{
    "safeguards":[
       {
         "type": "probe",
         "name": "probe",
		 "tolerance: "true"
       },
       {
         "type": "probe",
         "name": "background probe",
		 "tolerance: "true",
         "background":true
       },
       {
         "type": "probe",
         "name": "background probe with 60 seconds frequency",
		 "tolerance: "true",
         "background":true,
         "frequency":60
       }
     ]
   }
}

Why to force tolerance on those probes? it is not better to give the user to decide if this is a regular probe or probe with tolerance that can terminate the experiment.

Thanks for the great comment 👍

It's interesting your use of the background property because in my proposal, I thought we'd make the whole block a background thread. The reason is, otherwise, I don't know when to play these safeguards during the experiment flow. They should play concurrently, all of them. wdyt?

The tolerance is here to make the decision the execution should be interrupted. If I only use regular probes, how do they make the decision to interrupt? Hah you are thinking of the probe in Python and use the exit_grafecully function here, right?

I assume the safeguards element will be located before the SSH element.
I think it better to give users decide if they want a regular way or concurrent for each probe.
In addition, in the method element, it works(not 100% sure) as I described above, so this adds consistency too.

Another thing my college pointed that there is no need to put "background":true,"frequency":60 in the same probe because if you use frequency it means that the probe must run in the background.

The same idea is for tolerance, if the user has tolerance in this element CTK will handle a termination,
in case there is no tolerance user can terminate using exit_grafecully or decide to do nothing.
I don't think you need to force behavior like in SSH probes, maybe I wrong here and there are other things to think about.

Another thing that came to my mind that the termination , both CTK initiated and exit_grafecully must be aware of the currently executed element if you before method you don't need to rollback otherwise you need to rollback.

I find we think in terms of implementation rather than spec and I'm not comfortable with the idea. For instance exit_gracefully doesn't mean anything to the spec, it's a chaostoolkit implementation detail.

I feel you guys have a specific use case in mind but I think it may turn the general use case into a more complicated story.

The safeguards block is here to protect your system from the experiment currently running. If you have a complex workflow here, I can appreciate that but I'm finding it really confusing for the general case.

I need to think abit more about it.

I also like the consistency with the SSH rather than the method. Both safefuards and the ssh share a similar spirit of "make a decision about this experiment at regular basis". This means that if the experiment is explicit about the validation for ssh it should also be for safeguards. In fact you asked for the experiment being more explicit and you talk a lot about hiding things in the Python probes in the end. This is not consistent.

Good point, We will try to think more high level.
Regarding tolerance, make sense to have it strict as the SSH to have less confusion.

Let's jump back a bit, in my world, we have 2 types of safeguards, those executed before the experiment only once and those executed during the experiment with frequency, some of them are the same and some of them are different.

If the whole safeguards element is executed in parallel in the background and I have a probe that takes 10 sec to execute I might execute the method element before this safeguard probe had a chance to make his work.
The only option I can think of to prevent the scenario above is to run it as a regular blocking probe.
That supports the background property in the probe himself.

I think I'm completely happy with making sure the safeguards get run at least once before anything else. The repeatability is only managed the ctk flags (like the --hypothesys-strategy=continuously flag).

So:

  • default is to run the safeguards once before anything else
  • a flag that tells ctk: run the safeguards all along (with once before everything else of course) at a given frequency

By using tolerances, we then have a similar experience to the ssh.

The only additional flag I want is something that says "interrupt when all the safeguards fail or when any single one fails"

The only scenario not covered here is the idea to run a safeguard once but another many times.

The last problem is not the least one :-)
Any ideas except those I proposed for resolution?

We could decide that for safeguards, probes have extra properties to tell ctk how they should be applied. I'm not sure how this would look yet.

You dont like "frequency":60 props in probe?

Let's try it 👍 I will try to work on a first implementation tomorrow or Friday :)

Started working on an implementation. This is not trivial as with all the corner cases you exposed. This will take some time to complete next week.

According to @Lawouach this is delivered, closing the issue