Pytorch Job not getting detected by the training operator when launched via katib

Question

Pytorch Job not getting detected by the training operator when launched via katib

bharathappali opened this issue 8 months ago · comments

What steps did you take and what happened:
I'm a new katib user and I have installed katib and training operator in standalone installation on my local minikube cluster, and I have tried to start a HPO run to tune the hyper parameters of a PyTorch Job (distributed training example) . The experiment is not getting started as the training operator was not able to detect the kind PyTorchJob . Am I missing to configure anything? I was just using the default pytorch example which is mentioned in the Katib UI.

Training operator log:

[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl -n kubeflow logs -f training-operator-984cfd546-fqpdk
2024-01-30T08:00:45Z    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": ":8080"}
2024-01-30T08:00:45Z    INFO    setup    starting manager
2024-01-30T08:00:45Z    INFO    setup    registering controllers...
2024-01-30T08:00:45Z    INFO    Starting server    {"kind": "health probe", "addr": "[::]:8081"}
2024-01-30T08:00:45Z    INFO    starting server    {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "mxjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "xgboostjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "mpijob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "paddlejob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "tfjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "pytorchjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "paddlejob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "xgboostjob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "mpijob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "tfjob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "mxjob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "pytorchjob-controller", "worker count": 1}

Katib controller log:

{"level":"error","ts":"2024-01-30T12:10:32Z","logger":"trial-controller","msg":"Reconcile job error","Trial":{"name":"random-experiment-scpkhp2b","namespace":"kubeflow-user"},"error":"no matches for kind \"PyTorchJob\" in version \"kubeflow.org/v1\"","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/trial.(*ReconcileTrial).reconcileTrial\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:221\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/trial.(*ReconcileTrial).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:180\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}

CRD list

[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl get crds
NAME                   	CREATED AT
experiments.kubeflow.org   2024-01-30T07:19:28Z
mpijobs.kubeflow.org   	2024-01-30T08:00:30Z
mxjobs.kubeflow.org    	2024-01-30T08:00:31Z
paddlejobs.kubeflow.org	2024-01-30T08:00:31Z
pytorchjobs.kubeflow.org   2024-01-30T08:00:31Z
suggestions.kubeflow.org   2024-01-30T07:19:28Z
tfjobs.kubeflow.org    	2024-01-30T08:00:31Z
trials.kubeflow.org    	2024-01-30T07:19:28Z
xgboostjobs.kubeflow.org   2024-01-30T08:00:31Z

Environment:

Katib version (check the Katib controller image version): 0.16.0
Kubernetes version: (kubectl version):

[abharath@abharath-thinkpadt14sgen2i gpuXplore]$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.10", GitCommit:"0fa26aea1d5c21516b0d96fea95a77d8d429912e", GitTreeState:"archive", BuildDate:"2024-01-18T00:00:00Z", GoVersion:"go1.21.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:23:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.24) exceeds the supported minor version skew of +/-1

OS (uname -a):

[abharath@abharath-thinkpadt14sgen2i gpuXplore]$ uname -a
Linux abharath-thinkpadt14sgen2i.remote.csb 6.7.3-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb  1 03:29:52 UTC 2024 x86_64 GNU/Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

Yuki Iwai · Answer 1 · Tue Feb 27 2024 18:11:58 GMT+0800 (China Standard Time)

@bharathappali Did you deploy the training-operator first?
Also, could you share your Experiment YAML with us?

Bharath Appali · Answer 2 · Tue Feb 27 2024 18:40:36 GMT+0800 (China Standard Time)

Thanks for the response @tenzen-y

Did you deploy the training-operator first?

No I have deployed katib and later I deployed the training operator, should I install training operator first?

could you share your Experiment YAML with us?

Yes I have created the experiment again and here is the YAML from katib UI

metadata:
  name: random-experiment
  namespace: default
  uid: 86f7d5ca-b125-47c1-971b-0e0bc3906e41
  resourceVersion: '11624'
  generation: 1
  creationTimestamp: '2024-02-27T10:37:09Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: Go-http-client
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2024-02-27T10:37:09Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
            f:algorithmSettings: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:metricsCollectorSpec:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
          f:objective:
            .: {}
            f:additionalMetricNames: {}
            f:goal: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:parameters: {}
          f:resumePolicy: {}
          f:trialTemplate:
            .: {}
            f:failureCondition: {}
            f:primaryContainerName: {}
            f:successCondition: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:pytorchReplicaSpecs:
                  .: {}
                  f:Master:
                    .: {}
                    f:replicas: {}
                    f:restartPolicy: {}
                    f:template:
                      .: {}
                      f:spec:
                        .: {}
                        f:containers: {}
                  f:Worker:
                    .: {}
                    f:replicas: {}
                    f:restartPolicy: {}
                    f:template:
                      .: {}
                      f:spec:
                        .: {}
                        f:containers: {}
    - manager: Go-http-client
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2024-02-27T10:38:00Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:observation: {}
          f:pendingTrialList: {}
          f:startTime: {}
          f:trials: {}
          f:trialsPending: {}
      subresource: status
spec:
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        max: '0.03'
        min: '0.01'
        step: '0.01'
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
    metricStrategies:
      - name: Validation-accuracy
        value: max
      - name: Train-accuracy
        value: max
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
  trialTemplate:
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    name: pytorch
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    name: pytorch
    trialParameters:
      - name: learningRate
        reference: lr
    primaryPodLabels:
      training.kubeflow.org/job-role: master
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  metricsCollectorSpec:
    collector:
      kind: StdOut
  resumePolicy: Never
status:
  startTime: '2024-02-27T10:37:09Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2024-02-27T10:37:09Z'
      lastTransitionTime: '2024-02-27T10:37:09Z'
    - type: Running
      status: 'True'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2024-02-27T10:38:00Z'
      lastTransitionTime: '2024-02-27T10:38:00Z'
  currentOptimalTrial:
    observation: {}
  pendingTrialList:
    - random-experiment-65rvn6c7
    - random-experiment-lhdx4rjr
    - random-experiment-8gtk2z74
  trials: 3
  trialsPending: 3

Yuki Iwai · Answer 3 · Tue Feb 27 2024 18:55:01 GMT+0800 (China Standard Time)

No I have deployed katib and later I deployed the training operator, should I install training operator first?

Oh, I see. We must deploy the training-operator first. If we deployed the Katib first, we need to restart Katib Controller Pod after we deployed the training-operator. Could you confirm on your local?

Bharath Appali · Answer 4 · Tue Feb 27 2024 19:22:43 GMT+0800 (China Standard Time)

Thanks I have restarted katib and I have tried to run the pytorch job (distributed training example) but when I create experiment with this yaml

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-experiment
  namespace: ''
spec:
  maxTrialCount: 12
  parallelTrialCount: 3
  maxFailedTrialCount: 3
  resumePolicy: Never
  objective:
    type: maximize
    goal: 0.9
    objectiveMetricName: accuracy
    additionalMetricNames: []
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: '0.01'
        max: '0.03'
        step: '0.01'
  metricsCollectorSpec:
    collector:
      kind: StdOut
  trialTemplate:
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    retain: false
    trialParameters:
      - name: learningRate
        reference: lr
        description: ''
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'

This is the UI generated YAML and the training operator was unable to find training-container so I change the container names to training-container and I got this error.

2024-02-27T11:10:40Z	ERROR	PyTorchJob failed validation	{"pytorchjob": {"name":"random-experiment-bjl5fqvl","namespace":"default"}, "error": "PyTorchJobSpec is not valid: There is no container named pytorch in Master"}
2024-02-27T11:10:40Z	ERROR	Reconciler error	{"controller": "pytorchjob-controller", "object": {"name":"random-experiment-bjl5fqvl","namespace":"default"}, "namespace": "default", "name": "random-experiment-bjl5fqvl", "reconcileID": "2fb2991b-59fb-4d83-85ba-30c1765e7978", "error": "PyTorchJobSpec is not valid: There is no container named pytorch in Master"}

Later I changed the primary container name as pytorch as per PyTorchJobSpec and I can see the pods getting created

[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl get pods 
NAME                                                      READY   STATUS              RESTARTS      AGE
random-experiment-bayesianoptimization-5749c87757-nzg2l   1/1     Running             0             4m24s
random-experiment-kf8zd5hs-master-0                       0/2     ContainerCreating   0             4m2s
random-experiment-kf8zd5hs-worker-0                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-kf8zd5hs-worker-1                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-nfhzbkq4-master-0                       0/2     ContainerCreating   0             4m3s
random-experiment-nfhzbkq4-worker-0                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-nfhzbkq4-worker-1                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-pvwccdnr-master-0                       0/2     ContainerCreating   0             4m3s
random-experiment-pvwccdnr-worker-0                       0/1     Init:0/1            1 (41s ago)   4m3s
random-experiment-pvwccdnr-worker-1                       0/1     Init:0/1            1 (41s ago)   4m3s

Thanks @tenzen-y

Andrey Velichkevich · Answer 5 · Tue Feb 27 2024 23:21:46 GMT+0800 (China Standard Time)

Thank you for creating this @bharathappali.
That's correct, you have to name container as pytorch for PyTorchJob.

Did your Katib Trials succeed after you renamed the container name and primaryContainerName?

Bharath Appali · Answer 6 · Sun Mar 31 2024 15:41:21 GMT+0800 (China Standard Time)

Yes @andreyvelich I was able to run katib trials after the changes

Thank you! Closing this issue.