Project SIMMY: Support for failure / chaos injection policies to test resilience of the system

Question

Project SIMMY: Support for failure / chaos injection policies to test resilience of the system

mebjas opened this issue 6 years ago · comments

Having this library helps me introduce resilience to my project, but I don't want expected or unexpected failures to occur to test it out

The library could be wrongly implemented, testing the scenarios is not straight forward
Mocking failure of a dependency for example a cloud SaaS / PaaS service is not always provided straight away.

Describe your proposed or preferred solution:

A way to mock failures of dependencies (any service dependency for example).
This should be strict framework that takes care of when to fail based on some external factors - maybe global configuration or some rule.
Way to revert easily controlling the blast radius
Production grade to run this in production system with automation.

Describe any alternative options you've considered:
Doing it myself, manually disabling the service which supports that and checking if it really breaks the circuit and moves to fallback.

Any additional info?
We can definitely take learning from Netflix Simian Army.

Dylan Reisenberger · Answer 1 · Mon Sep 17 2018 05:09:02 GMT+0800 (China Standard Time)

Interesting question and ideas @mebjas - I had similar ideas in the past (as discussed with @joelhulen / @jhalterman / others) that the Polly architecture could be used for fault-injection.

Obviously there are established solutions in the market for mocking third-party endpoints, like Mountebank or Wiremock; and likewise many tools for disrupting traffic/terminating instances etc at the broader container or cloud infrastructure layer. But fault-injection within Polly could perhaps complement and serve different use cases:

Injecting faults via Polly by definition would happen in-process. This means no external infrastructure/tool/http-interception needed.
For the dev/test phase, easy to mock failure of a dependency to test the envisaged Polly strategy or other response.
Not all services are amenable to http mocking through tools such as Mountebank or Wiremock, eg if you are calling some Azure/AWS service through the AWS or Azure SDK.
Simulating failure of a dependency in process does allow you to simulate that failure for a specific consumer/group-of-consumers without actually killing the real dependency or necessarily killing it for all consumers.

It would be relatively easy to implement in Polly - ideas coming in a follow-up post.

Dylan Reisenberger · Answer 2 · Mon Sep 17 2018 05:14:05 GMT+0800 (China Standard Time)

Fault injection proposal for Polly

For non-generic Polly policies:

var chaosPolicy = Policy.InjectFault(
    Exception | Func<Context, Exception> fault,  // The fault to inject
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 inclusive.  The policy will inject the fault, randomly, that proportion of the time, eg 0.01 means inject the fault randomly 1% of the time.
    Func<Context, bool> enabled // Faults are only injected when returns true.
);

By decimal | Func<decimal>, I mean an overload offering this parameter simply as decimal, and another overload offering it as Func<decimal>.

For generic `Policy<TResult>` policies governing executions returning `TResult`:

Same as above, but in addition to the Exception options, there would also be overloads offering the following three options for the fault parameter:

TResult fault
Func<Context, TResult> fault
Func<Context, DelegateResult<TResult>> fault
- This last one gives the ability to build an InjectFault policy which sometimes injects an exception, sometimes a return result (eg http calls may fail with exception or an http status code).

Operation

If enabled(...) returns true, and if a random number between 0 and 1 is less than injectionRate, return or throw the fault instead of executing the delegate passed to the policy for execution.
Otherwise execute the passed delegate as normal.

Usage

Typical usage would probably be to configure a InjectFault policy as the innermost part of a PolicyWrap (so it is closest to where the passed delegate would normally execute):

Policy.WrapAsync(retry, breaker, timeout, faultInject)

When InjectFault didn't inject a fault but merely acted as a pass-through, its performance should be comparable to NoOpPolicy: this has been benchmarked as adding approx 1microsecond (1 millionth of a second) to a call.

It would also be possible to use dynamic configuration during running to include or exclude the InjectFaultPolicy altogether, if desired.

The Func<Context, ...> signatures of all parameters would allow fine-grained control of instances or calls participating in the simulated faults, and indeed of the simulated faults themselves. Polly.Context can be set up with any custom input data; this could be eg drawn from configuration or environment variables, to control when, whether and how instances participate in the simulation.

Thoughts?

What do people think? How can we improve on this proposal?

I am going to mark this 'up-for-grabs' as it is very simple to implement following existing Polly patterns (great way for others to get involved in the proj), and I want to focus my Polly time on features such as Distributed Breaker which depend more on existing deep Polly knowledge. This would be an awesome feature for a new contributor to get public acknowledgment!

The parts to implement a policy are described here. A simple way to implement this would be to copy-and-rename all the parts for NoOpPolicy and then start adding the functionality described above.

Dylan Reisenberger · Answer 3 · Mon Sep 17 2018 05:20:19 GMT+0800 (China Standard Time)

Latency injection proposal

Another obvious failure injection would be latency:

var latencyPolicy = Policy.InjectLatency(
    TimeSpan | Func<Context, Timespan> latency, // The latency to inject.
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 (inclusive).  The policy will inject the latency, randomly, that proportion of the time, eg 0.01 means inject latency 1% of the time.
    Func<Context, bool> enabled // Latency is only injected when returns true.
);

Operation

If enabled returns true, and if a random number between 0 and 1 is less than injectionRate, pause for the specified latency.
Then execute the passed delegate as normal.

Questions

Should the policy pause, then execute the passed delegate? Or should it execute the passed delegate, obtain the result, then pause before returning the result or failure? Are there any use cases where it makes any difference?

minhaz · Answer 4 · Mon Sep 17 2018 12:09:24 GMT+0800 (China Standard Time)

@reisenberger Thanks for response! I think I can take this up if that's how it works.

minhaz · Answer 5 · Mon Sep 17 2018 15:45:11 GMT+0800 (China Standard Time)

Latency injection proposal

Another obvious failure injection would be latency:
var latencyPolicy = Policy.InjectLatency(
    TimeSpan | Func<Context, Timespan> latency, // The latency to inject.
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 (inclusive).  The policy will inject the latency, randomly, that proportion of the time, eg 0.01 means inject latency 1% of the time.
    Func<Context, bool> enabled // Latency is only injected when returns true.
);
Operation

If enabled returns true, and if a random number between 0 and 1 is less than injectionRate, pause for the specified latency.

Then execute the passed delegate as normal.

Isn't this just another type of fault. If the fault could be passed as an argument we need not write individual classes for each then. Latency injection can be shared as an example.

Questions

Should the policy pause, then execute the passed delegate? Or should it execute the passed delegate, obtain the result, then pause before returning the result or failure? Are there any use cases where it makes any difference?

Both the cases might be needed. While, before the operation makes more sense to me and use case in my mind. If I were to wrap it around the DB call for example - I might be looking at a timeout or so with operation failure. While if the latency is injected post execution that would mean - actual operation succeeded while the upper layer timed out thus forcing me to handle more scenarios. Which ain't bad either 🤕

Geovanny Alzate Sandoval · Answer 6 · Tue Sep 18 2018 00:41:14 GMT+0800 (China Standard Time)

Hey guys, I would like to know if someone is working on this, if don't I would like to help on this, then you can assign to me this task. 😃

Dylan Reisenberger · Answer 7 · Tue Sep 18 2018 04:03:01 GMT+0800 (China Standard Time)

@vany0114 @mebjas I think you have both volunteered to work on this. Is there the option to maybe take one policy (of InjectFault; InjectLatency) each, and review each other's contribution? (I will review too). Or find some other way to collaborate? (Open to suggestions...). Thank you both for your enthusiasm!

Geovanny Alzate Sandoval · Answer 8 · Tue Sep 18 2018 04:13:55 GMT+0800 (China Standard Time)

Of course, I like that idea, so let @mebjas decide which policy wants to work on since he's the owner of the issue, then I could work on the other issue!

minhaz · Answer 9 · Tue Sep 18 2018 04:43:39 GMT+0800 (China Standard Time)

@reisenberger @vany0114 If they were supposed to be different policies I could go with InjectFault. As of now I just started pushing to this branch - https://github.com/mebjas/Polly/tree/dev-mebjas/src/Polly.Shared/InjectFault

I have a doubt though, how is InjectFault different from InjectLatency?
If the definition looked like following:

var chaosPolicy = Policy.InjectFault(
    Exception | Func<Context, Exception> fault | Action<Context> fault,  // The fault to inject
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 inclusive.  The policy will inject the fault, randomly, that proportion of the time, eg 0.01 means inject the fault randomly 1% of the time.
    Func<Context, bool> enabled // Faults are only injected when returns true.
);

I could use it as following for example to inject latency.

Policy.InjectFault(
    (ctx) => { Thread.Sleep(5000); },
    0.2,
    (ctx) => { return true; }
);

In this way we could define different kinds of faults within the delegate fault itself.

Dylan Reisenberger · Answer 10 · Tue Sep 18 2018 04:54:51 GMT+0800 (China Standard Time)

@mebjas Very interesting comment above - I was typing my generalisation/abstraction idea simultaneously. I will still post it (below) and we can all consider further.

@mebjas This is a great question! :

Isn't this [latency] just another type of fault. If the fault could be passed as an argument we need not write individual classes for each then.

We could consider two separate questions:

(a) is the concept of Policy.InjectLatency(...) a primary-enough concept in chaos-engineering, that it is worth providing this direct syntax for?
(b) is there enough commonality of functionality, that we can use a common class (or abstract base class) for the features discussed in this thread?

Re (a), my thought was that latency-injection is a primary-enough concept that it is worth providing some simple/obvious public syntax, even if we (b) generalise the implementation internally.

Re (b), I had the thought yesterday that we could generalise both cases by abstracting out the general injection concept, something like:

    Policy.InjectBehaviour( // Or: InjectCustom // Or just: Inject
        Func<Context, CancellationToken, (DelegateResult<TResult>, bool)> preExecute, 
            // The DelegateResult<TResult> output parameter of preExecute is an _optional_ substitute result (which would replace executing the passed delegate). 
            // The bool output parameter of preExecute is whether to use this replacement result, or call as normal the delegate which was passed to .Execute/Async(...).
        Func<Context, CancellationToken, DelegateResult<TResult>, DelegateResult<TResult>> postExecute, 
            // The input DelegateResult<TResult> parameter is the result from executing the delegate passed for execution by .Execute/Async(...). 
            // The output DelegateResult<TResult> parameter ... allows postExecute to augment, extend, or even replace the original obtained result.
        Func<Context, bool> isOperative
           // ... encapsulates any logic (however complex) 
    );

Maybe it's easier to communicate that the synchronous implementation (BehaviourInjectionPolicyEngine) would be something like:

internal static TResult Implementation<TResult>(Func<Context, CancellationToken, TResult> action, Context context, CancellationToken cancellationToken, Func<Context, CancellationToken, (DelegateResult<TResult>, bool)> preExecute, Func<Context, CancellationToken, DelegateResult<TResult>, DelegateResult<TResult>> postExecute,  Func<Context, bool> isOperative)
{
    if (!isOperative(context)) return action(context, cancellationToken);

    (DelegateResult<TResult> result, bool substitute) preExecuteResult = preExecute(context, cancellationToken);

    if (preExecuteResult.substitute) return preExecuteResult.result;

    DelegateResult<TResult> normalExecutionResult = action(context, cancellationToken);

    return postExecute(context, cancellationToken, normalExecutionResult);
}

This builds a generalised pre-execution-injection and post-execution-injection architecture, which users could repurpose for any means they want (if we exposed it publicly, not just as a hidden abstract base-class). Maybe they want to use postExecute to augment results returned from a particular call. Maybe they want some custom logging in postExecute without having to use FallbackPolicy as a logging workaround.

I don't have an immediate preference - first, just sharing ideas, and we can all reflect.

Dylan Reisenberger · Answer 11 · Tue Sep 18 2018 05:10:15 GMT+0800 (China Standard Time)

( Apologies: poss some mistakes in syntax around whether some of the return values are DelegateResult<TResult> or just TResult in the example above, but it's to illustrate the concept. I can correct tomorrow. )

minhaz · Answer 12 · Tue Sep 18 2018 05:34:42 GMT+0800 (China Standard Time)

@reisenberger

(a) is the concept of Policy.InjectLatency(...) a primary-enough concept in chaos-engineering, that it is worth providing this direct syntax for?
(b) is there enough commonality of functionality, that we can use a common class (or abstract base class) for the features discussed in this thread?

Re (a), my thought was that latency-injection is a primary-enough concept that it is worth providing some simple/obvious public syntax, even if we (b) generalise the implementation internally.

Thinking of it like this, yes it makes sense. Policy.InjectLatency(...) could be an example implementation of Policy.InjectFault(...) . Thus making both options available as first class citizens within Polly. Or they could both share common execution.

Re (b), I had the thought yesterday that we could generalise both cases by abstracting out the general injection concept, something like:

While this seem to be best generalization which cover both of the above and give users ability to inject any behavior before or after the real delegate execution, is this the real use case of chaos engineering here to inject any piece of code? Would it not lead to wrong usage? Exposing Policy.InjectLatency(...), Policy.InjectFault(...) would ensure focused usage.

Has this kind of generalization been used for any other concepts implemented in Polly?

Geovanny Alzate Sandoval · Answer 13 · Tue Sep 18 2018 05:34:59 GMT+0800 (China Standard Time)

@reisenberger I like the first option:

latency-injection is a primary-enough concept that it is worth providing some simple/obvious public syntax, even if we (b) generalize the implementation internally.

I think the generalization is a great idea, but I also think an API should be descriptive enough, having said that, the latency should be considered as an exception itself? I don't think so. I think we should separate both concerns even if we gonna call/use the same base class or something internally, for both InjectFault and InjectLatency

minhaz · Answer 14 · Thu Sep 20 2018 00:21:30 GMT+0800 (China Standard Time)

@reisenberger @vany0114 Added code for InjectFault policy here. Added support for

Action<Context> fault

in InjectFaultEngine.cs which can be used to build InjectLatency policy.

@reisenberger
What all need to be done before sending a PR here. I see unit tests need to be written. Is Readme and wiki changes also needed?
Also, please have a look and tell if I have set of on wrong course code wise.

Geovanny Alzate Sandoval · Answer 15 · Thu Sep 20 2018 00:39:37 GMT+0800 (China Standard Time)

thanks a lot, @mebjas!, so, @reisenberger, should I wait for mebjas code will be merged in order to start working on InjectLatency and make my PR?

Dylan Reisenberger · Answer 16 · Thu Sep 20 2018 01:00:23 GMT+0800 (China Standard Time)

Thanks @mebjas @vany0114 both. I will review the code pushed so far and then advise further!

Ruben Bartelink · Answer 17 · Thu Sep 20 2018 09:47:30 GMT+0800 (China Standard Time)

As I read this, I the name Policy.Monkey is jumping out as the name for the high level InjectXxx driver API - alongside that, obviously nodding to and/or sharing names and doc with Simian Army equivalents should be a primary part of the design and naming in order to make it all just make sense without having to read or write docs.

minhaz · Answer 18 · Thu Sep 20 2018 13:01:42 GMT+0800 (China Standard Time)

As I read this, I the name Policy.Monkey is jumping out as the name for the high level InjectXxx driver API - alongside that, obviously nodding to and/or sharing names and doc with Simian Army equivalents should be a primary part of the design and naming in order to make it all just make sense without having to read or write docs.

Make sense to me. Just correct me if I am wrong - Your suggestion is to have Policy.Monkey to be the base and we write Policy.InjectXXX on top of it right?

Ruben Bartelink · Answer 19 · Thu Sep 20 2018 14:27:15 GMT+0800 (China Standard Time)

Policy.InjectBehaviour( // Or: InjectCustom // Or just: Inject

Sorry should have been clearer; I feel the top level function and namespace can be such that you write Policy.Handle<X>.Monkey(...) or maybe .Chaos - you're monkeying with the call. Also namespace Polly.Monkey can house specific implementation artifacts.

Func<Context, bool> isOperative

Also, for any chaos method, I'd say the rate parameter and/or decision-to-apply-this-time function should go near the front of the list of args (assuming this fits with other APIs - I don't know the feel of the DSL well). I have given little thought, but assume you'd have a rate and an optional shouldApply which defaults to ()=>true at the desired rate etc.

(this is thinking aloud - I've not looked at a chaos suite or DSL for good ways to do it)

Perhaps Monkey Can control rate and shouldApply and then invoke a Func<context....>, which then gives you access to a mini-DSL which lets you do specific behaviors ?

Policy.Monkey( .1, new [] { Polly.Monkey.Latency(() => <timespan>), Polly.Monkey.Mangle(result => update(result)) } )

or

var latency = Polly.Monkey.Latency(<specify range in various ways>);
var mangle = Polly.Monkey.Mangle<TResult>(<specify handler in various ways>);
var chaos =
    Policy.Monkey( ctx => !ctx.ContainsKey["DisableChaos"], factor =>
    {
       if (factor < .95) return null; /// to signal no chaos this time
       else if (factor < .99) latency
       else mangle
   })....   
// alternately
var chaos2 =
    Policy.Monkey( (ctx, factor) =>
    {
       if (factor < .95 || !ctx.ContainsKey["DisableChaos"]) return null; /// to signal no chaos this time
       else if (factor < .99) return latency;
       else return mangle;
   })....   
} )

Also maybe a .Monkeys overload can return an enumerable of IMonkeys to apply for people who want to admit the potential of >1

The other thing that's important (to me at least!), is to have a way in which to examine the monkey(s) selected for application (and perhaps even post-application) in order that I can log the ones applied in some meaningful way

This should also mesh well with being able to neatly plug in and/or test monkeys you add (or provide custom ones)

declaration: the above is coIoured by me being in the middle of thinking about (and implementing) a call policy wrapping lib (which I intend to open source after bedding into our systems) which uses Polly at its heart. The body of my .Monkey selector method would be driven off a table in a config DSL external to the app. In a yamly type syntax, the policy might define sets of monkeys (tribes?!) which can be applied to actions:-

tribe general:
   rate: .10, type: addLatency(min:5ms, max: 2s, p95: .2s)
   rate: .01, type: badJson,
   rate: .01, type: noBody
   rate: .01, type: timeout
   rate: .01, type: isolate
   rate: .01, type: brokenCircuit
tribe latencyOnly:
   rate: .10, type: addLatency(min:5ms, max: 2s, p95: .2s)
   rate: .01, type: timeout

actions:
  viewCart
    monkeys: general
    logging: whenDebug
  addToCart
    monkeys: general 
  placeOrder
    monkeys: latencyOnly
    logging: max

I'd parse those into a Discriminated Union which holds the values and would then be looking to map each one to constructing an IMonkey, and use the logging to control how much to do.

Also for the Latency one - the latency should probably be configurable to include or exclude the time the actual underlying call took (or whether the circuit is isolated or broken etc).

Now, before I go off the deep end, can someone please provide links to some sensible prior art for this triangular wheel I'm reinventing ;)

Dylan Reisenberger · Answer 20 · Thu Sep 20 2018 15:56:05 GMT+0800 (China Standard Time)

Liking a lot of the thinking in / behind this, @bartelink ! (Some things we mby have slightly different style to do in Polly, but I can cover that off, it's awesome just to hear all the ideas/intent so that we can aim to meet them and/or doco how to achieve.) Longer response to follow ...

Thanks for all the awesome input/contribs on this thread folks - keep it coming!

Dylan Reisenberger · Answer 21 · Tue Sep 25 2018 15:43:13 GMT+0800 (China Standard Time)

@mebjas @vany0114 @bartelink Thanks again for all the great input/contributions! I am on the road rest of week but will aim to look at this one of the evenings if events allow.

minhaz · Answer 22 · Thu Sep 27 2018 02:50:46 GMT+0800 (China Standard Time)

@bartelink Thanks for describing it in such details. As @reisenberger said there are certain ways to how things are implemented in Polly, adhering to that we can implement this in layers. Leaving the method names aside it seem every one agree the fault injection policy shall be governed by the following parameters itself

fault - Exception | Func <Context, Exception> | Action<Context> ...
rate - Decimal | Func <Context, Decimal>
isOperative (or enabled) - Func <Context, Decimal>

@reisenberger Is it ok if I send a PR to the dev branch after completing everything required by the code of conduct here and have further discussions over there?

Ruben Bartelink · Answer 23 · Sat Sep 29 2018 09:14:36 GMT+0800 (China Standard Time)

@mebjas If you do post a dev branch WIP PR, I'd be interested in seeing the API you're working on taking shape - I can't promise I'll have any deep insight, but am happy to provide any feedback I can (FYI #496 contains links to a sneak peak of the library I hinted at. I've yet to commence work on fleshing an example format whereby one'd declare chaos rules alongside normal processing rules per Action (that aspect of the code is very far from polished atm, I'm hoping to address that some time next week)).

minhaz · Answer 24 · Tue Oct 02 2018 17:11:55 GMT+0800 (China Standard Time)

@reisenberger @bartelink - check this out, I haven't written the tests yet but structured the code like this.
PR - #508

Dylan Reisenberger · Answer 25 · Tue Oct 02 2018 17:28:12 GMT+0800 (China Standard Time)

Thanks @mebjas ! Now looking at this.

Dylan Reisenberger · Answer 26 · Wed Oct 03 2018 05:40:41 GMT+0800 (China Standard Time)

Thanks again @mebjas for #508 ! (comments posted there).

@mebjas @bartelink @vany0114 / anyone: suggest we let @mebjas (and @vany0114 ?) get this to fully functionally tested, then return to any more philosophical questions about scope, naming, abstract base classes etc from this thread ... (tho all detailed comment/review/sugg re WIP on #508 welcome in the meantime!). Thanks.

Geovanny Alzate Sandoval · Answer 27 · Wed Oct 03 2018 05:50:05 GMT+0800 (China Standard Time)

sure, let me take a look at @mebjas PR. I will happy to help with this 😃

Dylan Reisenberger · Answer 28 · Wed Oct 03 2018 16:56:57 GMT+0800 (China Standard Time)

@vany0114 Cool. Probably worth letting @mebjas finish his part before beginning yours.

minhaz · Answer 29 · Sun Oct 14 2018 22:46:01 GMT+0800 (China Standard Time)

@reisenberger @vany0114

I have covered 3/4 checks. Including the code and unit tests. Please have a look and let me know if anything else is needed.

Dylan Reisenberger · Answer 30 · Mon Oct 15 2018 02:28:02 GMT+0800 (China Standard Time)

Thank you @mebjas ! I will review as soon as possible.

Dylan Reisenberger · Answer 31 · Sun Dec 02 2018 18:36:11 GMT+0800 (China Standard Time)

Cross-posting to catch anybody who has contributed to / is following / later reads this thread: we propose to promote the chaos-engineering dimension to its own package.

Dylan Reisenberger · Answer 32 · Mon Jan 21 2019 02:07:04 GMT+0800 (China Standard Time)

Anyone interested in the chaos-engineering functionality (/cc @martincostello / @bartelink / anyone!), please see the new Simmy repo where the functionality is being finalised.

Comment welcome on any of the Issues there! A few issues represent remaining minor API enhancements I would like to get in place before a v0.1 release.

Some issues shape the API and there is some particular request for community feedback.

Project SIMMY: Support for failure / chaos injection policies to test resilience of the system

Fault injection proposal for Polly

For non-generic Polly policies:

For generic Policy<TResult> policies governing executions returning TResult:

Operation

Usage

Thoughts?

Latency injection proposal

Operation

Questions

Latency injection proposal

Operation

Questions

For generic `Policy<TResult>` policies governing executions returning `TResult`: