Decouple individual pod termination frequency from cluster size
linki opened this issue · comments
Currently, the probability of a pod being killed depends on the number of pods being in the target group. This is bad if you want to run chaoskube
as a cluster addon and opt-in to being killed via annotations because you cannot estimate how often that would happen.
Proposal
Allow specifying or at least somehow keep track of what's going on so Pod terminations happen in a somewhat predictable way. For example, instead of terminating a single pod every 10 minutes, each pod may have a probability of X% of being killed per hour. This, hopefully, would make pod terminations independent of the number of pods running in total.
Would you like this to be pod specific or a cluster wide probability?
I was thinking about making it pod specific but I also see value in a global version of it like you propose in #34.
For the pod specific version I thought one would annotate a PodSpec with something like:
chaos.alpha.kubernetes.io/frequency=2/day
for "kill this twice per day"chaos.alpha.kubernetes.io/frequency=10/hour
for "kill this ten times per hour"chaos.alpha.kubernetes.io/frequency=1/week
for "kill this once a week"- etc.
To implement this: one invokes chaoskube
at a certain interval like before and then calculate a probability per pod based on the desired frequency and how often chaoskube
is invoked.
For instance, let's assume chaoskube
runs at an interval of 1 minute and a pod has the annotation set to twice a day (2/day). Then on each iteration chaoskube
would calculate a probability that this pod should be killed like this:
- twice a day =>
24*60/2
=> every 720 minutes - since chaoskube runs every minute => 1/720 => 0,14% chance to kill this pod in each iteration.
or for ten times an hour (10/hour). - ten times an hour =>
60/10
=> every 6 minutes - since chaoskube runs every minute => 1/6 => 16,6% chance to kill this pod in each iteration.
This would also work with different intervals I think.
I'm not sure if this is correct but if it is it would allow chaoskube
to remain stateless and pods would be killed at roughly the same pace over time regardless of cluster size.
Sounds good.
Just a couple of thoughts/questions I had when I read this:
- What do you plan to do with non annotated pods?
- What happens to pods that want a higher kill rate than 60/hour?
- This is more complicated, but might be the better approach because every pod can opt-in and decide its own kill rate.
- Do we want that pod or namespace specific?