checkPodCount ends preemptively when 0 pods remain after pod killing

Question

checkPodCount ends preemptively when 0 pods remain after pod killing

paigerube14 opened this issue 3 years ago · comments

After killing the only pod that is running in your certain namespace, the checkPodCount ends incorrectly before a pod comes back and is running again.
I would expect the checkPodCount to continue for the entire duration of the timeout before passing or failing.

Sample scenario yaml:

config:
  runStrategy:
    runs: 1
    maxSecondsBetweenRuns: 30
    minSecondsBetweenRuns: 1
scenarios:
  - name: "delete etcd pods"
    steps:
    - podAction:
        matches:
          - labels:
              namespace: "etcd"
              selector: "k8s-app=etcd"
        filters:
          - randomSample:
              size: 1
        actions:
          - kill:
              probability: 1
              force: true
    - podAction:
        matches:
          - labels:
              namespace: "etcd"
              selector: "k8s-app=etcd"
        retries:
          retriesTimeout:
            timeout: 180

        actions:
          - checkPodCount:
              count: 1

Output:

2021-06-25 19:16:09 INFO __main__ No cloud driver - some functionality disabled
2021-06-25 19:16:09 INFO __main__ Using stdout metrics collector
2021-06-25 19:16:09 INFO __main__ NOT starting the UI server
2021-06-25 19:16:09 INFO __main__ STARTING AUTONOMOUS MODE
2021-06-25 19:16:12 INFO scenario.delete etcd pod Starting scenario 'delete etcd pods' (2 steps)
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matching 'labels' {'labels': {'namespace': 'etcd', 'selector': 'k8s-app=etcd'}}
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matched 1 pods for selector k8s-app=etcd in namespace etcd
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Initial set length: 1
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Filtered set length: 1
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Pod killed: [pod #0 name=etcd-master-00.qe-pr-sno2.qe.devcluster.openshift.com namespace=etcd containers=4state=Running labels:app=etcd,etcd=true,k8s-app=etcd,revision=2 annotations:kubernetes.io/config.hash=*,kubernetes.io/config.seen=2021-06-25T14:30:12.819685290Z,kubernetes.io/config.source=file,target.workload.openshift.io/management={"effect": "PreferredDuringScheduling"}]
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matching 'labels' {'labels': {'namespace': 'etcd', 'selector': 'k8s-app=etcd'}}
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matched 0 pods for selector k8s-app=etcd in namespace etcd
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Initial set length: 0
2021-06-25 19:16:12 INFO scenario.delete etcd pod Scenario finished
2021-06-25 19:16:12 INFO policy_runner All done here!

Naga Ravi Chaitanya Elluri · Answer 1 · Sat Jun 26 2021 04:58:07 GMT+0800 (China Standard Time)

@seeker89 PTAL when you get time. Thanks.

Chris Stanaway · Answer 2 · Tue Jun 29 2021 02:16:40 GMT+0800 (China Standard Time)

Per the documentation, retries specifies "An object of retry criteria to rerun set actions". As the actions are only performed on matched pods which passed the filter criteria, and there were zero such pods at the moment that matches was evaluated, the actions are never run.

I'd suggest inserting a waitAction prior to the second podAction.

Paige Patton · Answer 3 · Fri Sep 17 2021 02:10:33 GMT+0800 (China Standard Time)

In this case the waitAction is not super helpful because I would have to guess when the pod comes back which is the whole point of the retries in the podAction. The retires in the podAction should be used to verify the number of pods that exist. If 0 pods exist at the current time it should still wait until the time limit or retry count before failing.