Should workflow templates pre-exist prior to running `kubectl create -f workflow.yaml`?
Analect opened this issue · comments
@terrytangyuan .. I've been working my way through your book. Thanks for putting together such an informative book. The latest version available on Manning is v7 from May. Perhaps there is something more up to date ... that might explain why I'm hitting the issue described below?
From chapter 8, various CRDs are created with kubectl kustomize manifests | k apply -f -
. It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -
, for those that may not have an alias k
set up for kubectl
.
I notice that this calls the distributed-ml-patterns/code/project/manifests/kustomization.yaml
file, which in turn activates various manifests in the argo-workflows
and kubeflow-training
folders.
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
- argo-workflows/
- kubeflow-training/
It seems when I try to run kubectl create -f workflow.yaml
in Chapter 9, it fails (see below). I think it might be due to the absence of the correct workflow-templates pre-populated in argo. Could it be that manifests from the e2e-demo
folder should have been included in the kustomization.yaml
above, or is something else missing?
Appreciate your input. Thanks.
~ % kubectl get workflows -n kubeflow
NAME STATUS AGE MESSAGE
tfjob-wf-lzwv9 Failed 60m invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
~ % kubectl describe workflows tfjob-wf-lzwv9 -n kubeflow
Name: tfjob-wf-lzwv9
Namespace: kubeflow
Labels: workflows.argoproj.io/completed=true
workflows.argoproj.io/phase=Failed
Annotations: <none>
API Version: argoproj.io/v1alpha1
Kind: Workflow
Metadata:
Creation Timestamp: 2023-08-21T14:56:06Z
Generate Name: tfjob-wf-
Generation: 2
Resource Version: 8141
UID: 3a573b29-ef27-4940-9c99-5f2c541850ea
Spec:
Arguments:
Entrypoint: tfjob-wf
Pod GC:
Strategy: OnPodSuccess
Templates:
Inputs:
Metadata:
Name: tfjob-wf
Outputs:
Steps:
[map[arguments:map[] name:data-ingestion-step template:data-ingestion-step]]
[map[arguments:map[] name:distributed-tf-training-steps template:distributed-tf-training-steps]]
[map[arguments:map[] name:model-selection-step template:model-selection-step]]
[map[arguments:map[] name:create-model-serving-service template:create-model-serving-service]]
Volumes:
Name: model
Persistent Volume Claim:
Claim Name: strategy-volume
Name: data-ingestion-step
Name: distributed-tf-training-steps
Name: cnn-model
Name: cnn-model-with-dropout
Name: cnn-model-with-batch-norm
Name: model-selection-step
Name: create-model-serving-service
Status:
Conditions:
Status: True
Type: Completed
Finished At: 2023-08-21T14:56:06Z
Message: invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
Phase: Failed
Progress: 0/0
Started At: 2023-08-21T14:56:06Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning WorkflowFailed 60m workflow-controller invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -, for those that may not have an alias k set up for kubectl.
Thanks! That is fixed in the book and I just fixed them in the README file in the repo as well.
invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
It seems like the template does not exist, could you run a kubectl apply -f
of this file https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/manifests/e2e-demo/workflows-templates-tfjob.yaml?
Actually can you try the lastest version in main branch? That data ingestion template should already exist in https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/workflow.yaml#L26
@terrytangyuan .. thanks for the prompt response. I have been using the latest code as of today from this repo.
Yes, those e2e_demo
steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.
I also hadn't focused on the README at distributed-ml-patterns/tree/main/code/project/code, which contains some stuff that hasn't made it into the book version (7) I had been following. Maybe that has since been addressed.
Chapter 9 is full of great content, but I'm sometimes struggling in getting the code-snippets from the book working ... as you explain how each function works ... it seems they need to be run as the full python script (all functions together) ... like https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/multi-worker-distributed-training.py.
Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.
Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.
That's great feedback. Thank you! If you have specific recommendation on what prerequisites are missing to follow the code snippets in the last chapter, please let me know.
Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.
They should not be part of the book. Please follow the workflow.yaml
in the repo for now.
So I ran these, per the README, and they appeared to run OK (see middle job in screenshot below).
kubectl create -f manifests/e2e-demo/workflows-templates-tfjob.yaml
kubectl create -f manifests/e2e-demo/e2e-workflow.yaml
However, each time I try to run kubectl create -f workflow.yaml
using this file workflow.yaml, it is failing, per the error at the top of this thread.
It seems I am missing the various templates referenced here. It somehow expects these workflow-templates to pre-exist, but I can't find them anywhere in the code. Am I missing something?
Thank you! I just fixed it. Could you try again?
That change permitted the workflow to begin running. It seemed like a small tweak. Do you mind explaining what the fix was?
However, I'm seeing that the multi-worker-training-*
pods get stuck in pending.
If I look into one of them I see the following 'event' - Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.
Maybe this is a namespace issue. My persistentvolume
'strategy-volume' does exist ... but in the default
namespace. I think it was generated from here. Perhaps it needs to be in the kubeflow
namespace for this to work?
Did you run the following to change to current namespace?
kns kubeflow
Once it's ran, all your manifests without specifying the namespace will use the current namespace.
Is kns
an alias for something else?
I see it at blendle/kns.
I suppose the kubectl-native way would be kubectl config set-context --current --namespace=kubeflow
. I'm not often doing that.
Ran this:
% kubectl config set-context --current --namespace=kubeflow
Context "k3d-distml" modified.
% kubectl create -f workflow.yaml
workflow.argoproj.io/tfjob-wf-k45fh created
Same issue with Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.
Will have to try to re-create strategy-volume
in kubeflow
namespace tomorrow ... and see if that fixes.
It should be covered in previous chapter. Instead of switching current namespace, you can also add -n Kubeflow in your kubectl commands to specify namespace explicitly.
@terrytangyuan ... got things working by recreating the strategy-volume
in kubeflow
namespace.
I noticed here, that it should be "--model_type", "batch_norm"
... rather than dropout
that is repeated for models 2 and 3.
I'll close out this for now and maybe raise a separate issue ref. other feedback on Chap 9.
Great thanks!