PDB not created when Workflow waiting on lock
agilgur5 opened this issue · comments
Pre-requisites
- I have double-checked my configuration
- I have tested with the
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below. - I have searched existing issues and could not find a match for this bug
- I'd like to contribute the fix myself (see contributing guide)
What happened/what did you expect to happen?
This is a follow-up to my findings in #6356 (comment) / #10178 (comment) / #12965 . This is technically a regression from 3.2.
When using a semaphore, mutex, or parallelism, if your Workflow cannot start due to waiting for a lock, any PDB on it will not be created.
PDB should be created regardless of usage of semaphore or mutex
Version
v3.5.6, latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
- Run this Workflow twice in quick succession (e.g. submit then immediately resubmit)
metadata:
generateName: synchronization-wf-level-
spec:
podDisruptionBudget:
minAvailable: 1
synchronization:
mutex:
name: workflow
entrypoint: whalesay
templates:
- name: whalesay
container:
image: docker/whalesay:latest
command:
- sleep
args:
- "30"
- Check for PDBs:
$ kubectl get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
synchronization-wf-level-6zqm6 1 N/A 0 22s
Only 1 PDB, not 2. The second never gets created
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
Logs from in your workflow's wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
Per #6356 (comment) and #10178 (comment), the solution here is actually not entirely straightforward unfortunately (otherwise, I would have submitted a PR directly). A Workflow without a lock shouldn't be creating resources like a PDB.
That's not the worst thing if we were to do that, but it's not quite correct and affects latency when using synchronization as well. So ideally this needs a larger refactor