PDB not created when Workflow waiting on lock

Question

PDB not created when Workflow waiting on lock

agilgur5 opened this issue 2 months ago · comments

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

This is a follow-up to my findings in #6356 (comment) / #10178 (comment) / #12965 . This is technically a regression from 3.2.

When using a semaphore, mutex, or parallelism, if your Workflow cannot start due to waiting for a lock, any PDB on it will not be created.

PDB should be created regardless of usage of semaphore or mutex

Version

v3.5.6, latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Run this Workflow twice in quick succession (e.g. submit then immediately resubmit)

metadata:
  generateName: synchronization-wf-level-
spec:
  podDisruptionBudget:
    minAvailable: 1
  synchronization:
    mutex:
      name: workflow
  entrypoint: whalesay
  templates:
    - name: whalesay
      container:
        image: docker/whalesay:latest
        command:
          - sleep
        args:
          - "30"

Check for PDBs:

$ kubectl get pdb
NAME                             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
synchronization-wf-level-6zqm6   1               N/A               0                     22s

Only 1 PDB, not 2. The second never gets created

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Anton Gilgur · Answer 1 · Tue Apr 23 2024 01:17:52 GMT+0800 (China Standard Time)

Per #6356 (comment) and #10178 (comment), the solution here is actually not entirely straightforward unfortunately (otherwise, I would have submitted a PR directly). A Workflow without a lock shouldn't be creating resources like a PDB.

That's not the worst thing if we were to do that, but it's not quite correct and affects latency when using synchronization as well. So ideally this needs a larger refactor