linki / chaoskube

chaoskube periodically kills random pods in your Kubernetes cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decouple individual pod termination frequency from cluster size

linki opened this issue · comments

Currently, the probability of a pod being killed depends on the number of pods being in the target group. This is bad if you want to run chaoskube as a cluster addon and opt-in to being killed via annotations because you cannot estimate how often that would happen.

Proposal

Allow specifying or at least somehow keep track of what's going on so Pod terminations happen in a somewhat predictable way. For example, instead of terminating a single pod every 10 minutes, each pod may have a probability of X% of being killed per hour. This, hopefully, would make pod terminations independent of the number of pods running in total.

Would you like this to be pod specific or a cluster wide probability?

I was thinking about making it pod specific but I also see value in a global version of it like you propose in #34.

For the pod specific version I thought one would annotate a PodSpec with something like:

  • chaos.alpha.kubernetes.io/frequency=2/day for "kill this twice per day"
  • chaos.alpha.kubernetes.io/frequency=10/hour for "kill this ten times per hour"
  • chaos.alpha.kubernetes.io/frequency=1/week for "kill this once a week"
  • etc.

To implement this: one invokes chaoskube at a certain interval like before and then calculate a probability per pod based on the desired frequency and how often chaoskube is invoked.

For instance, let's assume chaoskube runs at an interval of 1 minute and a pod has the annotation set to twice a day (2/day). Then on each iteration chaoskube would calculate a probability that this pod should be killed like this:

  • twice a day => 24*60/2 => every 720 minutes
  • since chaoskube runs every minute => 1/720 => 0,14% chance to kill this pod in each iteration.
    or for ten times an hour (10/hour).
  • ten times an hour => 60/10 => every 6 minutes
  • since chaoskube runs every minute => 1/6 => 16,6% chance to kill this pod in each iteration.

This would also work with different intervals I think.

I'm not sure if this is correct but if it is it would allow chaoskube to remain stateless and pods would be killed at roughly the same pace over time regardless of cluster size.

Sounds good.

Just a couple of thoughts/questions I had when I read this:

  • What do you plan to do with non annotated pods?
  • What happens to pods that want a higher kill rate than 60/hour?
  • This is more complicated, but might be the better approach because every pod can opt-in and decide its own kill rate.
  • Do we want that pod or namespace specific?