terrytangyuan / distributed-ml-patterns

Distributed Machine Learning Patterns from Manning Publications by Yuan Tang https://bit.ly/2RKv8Zo

Home Page:https://bit.ly/2RKv8Zo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should workflow templates pre-exist prior to running `kubectl create -f workflow.yaml`?

Analect opened this issue · comments

@terrytangyuan .. I've been working my way through your book. Thanks for putting together such an informative book. The latest version available on Manning is v7 from May. Perhaps there is something more up to date ... that might explain why I'm hitting the issue described below?

image

From chapter 8, various CRDs are created with kubectl kustomize manifests | k apply -f -. It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -, for those that may not have an alias k set up for kubectl.

I notice that this calls the distributed-ml-patterns/code/project/manifests/kustomization.yaml file, which in turn activates various manifests in the argo-workflows and kubeflow-training folders.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow

resources:
- argo-workflows/
- kubeflow-training/

It seems when I try to run kubectl create -f workflow.yaml in Chapter 9, it fails (see below). I think it might be due to the absence of the correct workflow-templates pre-populated in argo. Could it be that manifests from the e2e-demo folder should have been included in the kustomization.yaml above, or is something else missing?

Appreciate your input. Thanks.

~ % kubectl get workflows -n kubeflow
NAME             STATUS   AGE   MESSAGE
tfjob-wf-lzwv9   Failed   60m   invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
~ % kubectl describe workflows tfjob-wf-lzwv9 -n kubeflow
Name:         tfjob-wf-lzwv9
Namespace:    kubeflow
Labels:       workflows.argoproj.io/completed=true
              workflows.argoproj.io/phase=Failed
Annotations:  <none>
API Version:  argoproj.io/v1alpha1
Kind:         Workflow
Metadata:
  Creation Timestamp:  2023-08-21T14:56:06Z
  Generate Name:       tfjob-wf-
  Generation:          2
  Resource Version:    8141
  UID:                 3a573b29-ef27-4940-9c99-5f2c541850ea
Spec:
  Arguments:
  Entrypoint:  tfjob-wf
  Pod GC:
    Strategy:  OnPodSuccess
  Templates:
    Inputs:
    Metadata:
    Name:  tfjob-wf
    Outputs:
    Steps:
      [map[arguments:map[] name:data-ingestion-step template:data-ingestion-step]]
      [map[arguments:map[] name:distributed-tf-training-steps template:distributed-tf-training-steps]]
      [map[arguments:map[] name:model-selection-step template:model-selection-step]]
      [map[arguments:map[] name:create-model-serving-service template:create-model-serving-service]]
  Volumes:
    Name:  model
    Persistent Volume Claim:
      Claim Name:  strategy-volume
    Name:          data-ingestion-step
    Name:          distributed-tf-training-steps
    Name:          cnn-model
    Name:          cnn-model-with-dropout
    Name:          cnn-model-with-batch-norm
    Name:          model-selection-step
    Name:          create-model-serving-service
Status:
  Conditions:
    Status:     True
    Type:       Completed
  Finished At:  2023-08-21T14:56:06Z
  Message:      invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
  Phase:        Failed
  Progress:     0/0
  Started At:   2023-08-21T14:56:06Z
Events:
  Type     Reason          Age   From                 Message
  ----     ------          ----  ----                 -------
  Warning  WorkflowFailed  60m   workflow-controller  invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined

It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -, for those that may not have an alias k set up for kubectl.

Thanks! That is fixed in the book and I just fixed them in the README file in the repo as well.

invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined

It seems like the template does not exist, could you run a kubectl apply -f of this file https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/manifests/e2e-demo/workflows-templates-tfjob.yaml?

Actually can you try the lastest version in main branch? That data ingestion template should already exist in https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/workflow.yaml#L26

@terrytangyuan .. thanks for the prompt response. I have been using the latest code as of today from this repo.

Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.

I also hadn't focused on the README at distributed-ml-patterns/tree/main/code/project/code, which contains some stuff that hasn't made it into the book version (7) I had been following. Maybe that has since been addressed.

Chapter 9 is full of great content, but I'm sometimes struggling in getting the code-snippets from the book working ... as you explain how each function works ... it seems they need to be run as the full python script (all functions together) ... like https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/multi-worker-distributed-training.py.

Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.

Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.

That's great feedback. Thank you! If you have specific recommendation on what prerequisites are missing to follow the code snippets in the last chapter, please let me know.

Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.

They should not be part of the book. Please follow the workflow.yaml in the repo for now.

So I ran these, per the README, and they appeared to run OK (see middle job in screenshot below).

kubectl create -f manifests/e2e-demo/workflows-templates-tfjob.yaml
kubectl create -f manifests/e2e-demo/e2e-workflow.yaml

image

However, each time I try to run kubectl create -f workflow.yaml using this file workflow.yaml, it is failing, per the error at the top of this thread.

It seems I am missing the various templates referenced here. It somehow expects these workflow-templates to pre-exist, but I can't find them anywhere in the code. Am I missing something?

Thank you! I just fixed it. Could you try again?

That change permitted the workflow to begin running. It seemed like a small tweak. Do you mind explaining what the fix was?

image

However, I'm seeing that the multi-worker-training-* pods get stuck in pending.

image

If I look into one of them I see the following 'event' - Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.

Maybe this is a namespace issue. My persistentvolume 'strategy-volume' does exist ... but in the default namespace. I think it was generated from here. Perhaps it needs to be in the kubeflow namespace for this to work?

image

Did you run the following to change to current namespace?

kns kubeflow

Once it's ran, all your manifests without specifying the namespace will use the current namespace.

Is kns an alias for something else?

I see it at blendle/kns.

I suppose the kubectl-native way would be kubectl config set-context --current --namespace=kubeflow. I'm not often doing that.

Ran this:

% kubectl config set-context --current --namespace=kubeflow
Context "k3d-distml" modified.
% kubectl create -f workflow.yaml
workflow.argoproj.io/tfjob-wf-k45fh created

Same issue with Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.

Will have to try to re-create strategy-volume in kubeflow namespace tomorrow ... and see if that fixes.

It should be covered in previous chapter. Instead of switching current namespace, you can also add -n Kubeflow in your kubectl commands to specify namespace explicitly.

@terrytangyuan ... got things working by recreating the strategy-volume in kubeflow namespace.

I noticed here, that it should be "--model_type", "batch_norm" ... rather than dropout that is repeated for models 2 and 3.

I'll close out this for now and maybe raise a separate issue ref. other feedback on Chap 9.

Great thanks!