kubeflow / katib

Automated Machine Learning on Kubernetes

Home Page:https://www.kubeflow.org/docs/components/katib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pytorch Job not getting detected by the training operator when launched via katib

bharathappali opened this issue · comments

What steps did you take and what happened:
I'm a new katib user and I have installed katib and training operator in standalone installation on my local minikube cluster, and I have tried to start a HPO run to tune the hyper parameters of a PyTorch Job (distributed training example) . The experiment is not getting started as the training operator was not able to detect the kind PyTorchJob . Am I missing to configure anything? I was just using the default pytorch example which is mentioned in the Katib UI.

Training operator log:

[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl -n kubeflow logs -f training-operator-984cfd546-fqpdk
2024-01-30T08:00:45Z    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": ":8080"}
2024-01-30T08:00:45Z    INFO    setup    starting manager
2024-01-30T08:00:45Z    INFO    setup    registering controllers...
2024-01-30T08:00:45Z    INFO    Starting server    {"kind": "health probe", "addr": "[::]:8081"}
2024-01-30T08:00:45Z    INFO    starting server    {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "mxjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "xgboostjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "mpijob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "paddlejob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "tfjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z    INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z    INFO    Starting Controller    {"controller": "pytorchjob-controller"}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "paddlejob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "xgboostjob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "mpijob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "tfjob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "mxjob-controller", "worker count": 1}
2024-01-30T08:00:45Z    INFO    Starting workers    {"controller": "pytorchjob-controller", "worker count": 1}

Katib controller log:

{"level":"error","ts":"2024-01-30T12:10:32Z","logger":"trial-controller","msg":"Reconcile job error","Trial":{"name":"random-experiment-scpkhp2b","namespace":"kubeflow-user"},"error":"no matches for kind \"PyTorchJob\" in version \"kubeflow.org/v1\"","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/trial.(*ReconcileTrial).reconcileTrial\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:221\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/trial.(*ReconcileTrial).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:180\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}

CRD list

[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl get crds
NAME                   	CREATED AT
experiments.kubeflow.org   2024-01-30T07:19:28Z
mpijobs.kubeflow.org   	2024-01-30T08:00:30Z
mxjobs.kubeflow.org    	2024-01-30T08:00:31Z
paddlejobs.kubeflow.org	2024-01-30T08:00:31Z
pytorchjobs.kubeflow.org   2024-01-30T08:00:31Z
suggestions.kubeflow.org   2024-01-30T07:19:28Z
tfjobs.kubeflow.org    	2024-01-30T08:00:31Z
trials.kubeflow.org    	2024-01-30T07:19:28Z
xgboostjobs.kubeflow.org   2024-01-30T08:00:31Z

Environment:

  • Katib version (check the Katib controller image version): 0.16.0
  • Kubernetes version: (kubectl version):
[abharath@abharath-thinkpadt14sgen2i gpuXplore]$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.10", GitCommit:"0fa26aea1d5c21516b0d96fea95a77d8d429912e", GitTreeState:"archive", BuildDate:"2024-01-18T00:00:00Z", GoVersion:"go1.21.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:23:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.24) exceeds the supported minor version skew of +/-1
  • OS (uname -a):
[abharath@abharath-thinkpadt14sgen2i gpuXplore]$ uname -a
Linux abharath-thinkpadt14sgen2i.remote.csb 6.7.3-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb  1 03:29:52 UTC 2024 x86_64 GNU/Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

@bharathappali Did you deploy the training-operator first?
Also, could you share your Experiment YAML with us?

Thanks for the response @tenzen-y

Did you deploy the training-operator first?

No I have deployed katib and later I deployed the training operator, should I install training operator first?

could you share your Experiment YAML with us?

Yes I have created the experiment again and here is the YAML from katib UI

metadata:
  name: random-experiment
  namespace: default
  uid: 86f7d5ca-b125-47c1-971b-0e0bc3906e41
  resourceVersion: '11624'
  generation: 1
  creationTimestamp: '2024-02-27T10:37:09Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: Go-http-client
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2024-02-27T10:37:09Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
            f:algorithmSettings: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:metricsCollectorSpec:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
          f:objective:
            .: {}
            f:additionalMetricNames: {}
            f:goal: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:parameters: {}
          f:resumePolicy: {}
          f:trialTemplate:
            .: {}
            f:failureCondition: {}
            f:primaryContainerName: {}
            f:successCondition: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:pytorchReplicaSpecs:
                  .: {}
                  f:Master:
                    .: {}
                    f:replicas: {}
                    f:restartPolicy: {}
                    f:template:
                      .: {}
                      f:spec:
                        .: {}
                        f:containers: {}
                  f:Worker:
                    .: {}
                    f:replicas: {}
                    f:restartPolicy: {}
                    f:template:
                      .: {}
                      f:spec:
                        .: {}
                        f:containers: {}
    - manager: Go-http-client
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2024-02-27T10:38:00Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:observation: {}
          f:pendingTrialList: {}
          f:startTime: {}
          f:trials: {}
          f:trialsPending: {}
      subresource: status
spec:
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        max: '0.03'
        min: '0.01'
        step: '0.01'
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
    metricStrategies:
      - name: Validation-accuracy
        value: max
      - name: Train-accuracy
        value: max
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
  trialTemplate:
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    name: pytorch
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    name: pytorch
    trialParameters:
      - name: learningRate
        reference: lr
    primaryPodLabels:
      training.kubeflow.org/job-role: master
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  metricsCollectorSpec:
    collector:
      kind: StdOut
  resumePolicy: Never
status:
  startTime: '2024-02-27T10:37:09Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2024-02-27T10:37:09Z'
      lastTransitionTime: '2024-02-27T10:37:09Z'
    - type: Running
      status: 'True'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2024-02-27T10:38:00Z'
      lastTransitionTime: '2024-02-27T10:38:00Z'
  currentOptimalTrial:
    observation: {}
  pendingTrialList:
    - random-experiment-65rvn6c7
    - random-experiment-lhdx4rjr
    - random-experiment-8gtk2z74
  trials: 3
  trialsPending: 3

No I have deployed katib and later I deployed the training operator, should I install training operator first?

Oh, I see. We must deploy the training-operator first. If we deployed the Katib first, we need to restart Katib Controller Pod after we deployed the training-operator. Could you confirm on your local?

Thanks I have restarted katib and I have tried to run the pytorch job (distributed training example) but when I create experiment with this yaml

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-experiment
  namespace: ''
spec:
  maxTrialCount: 12
  parallelTrialCount: 3
  maxFailedTrialCount: 3
  resumePolicy: Never
  objective:
    type: maximize
    goal: 0.9
    objectiveMetricName: accuracy
    additionalMetricNames: []
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: '0.01'
        max: '0.03'
        step: '0.01'
  metricsCollectorSpec:
    collector:
      kind: StdOut
  trialTemplate:
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    retain: false
    trialParameters:
      - name: learningRate
        reference: lr
        description: ''
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'

This is the UI generated YAML and the training operator was unable to find training-container so I change the container names to training-container and I got this error.

2024-02-27T11:10:40Z	ERROR	PyTorchJob failed validation	{"pytorchjob": {"name":"random-experiment-bjl5fqvl","namespace":"default"}, "error": "PyTorchJobSpec is not valid: There is no container named pytorch in Master"}
2024-02-27T11:10:40Z	ERROR	Reconciler error	{"controller": "pytorchjob-controller", "object": {"name":"random-experiment-bjl5fqvl","namespace":"default"}, "namespace": "default", "name": "random-experiment-bjl5fqvl", "reconcileID": "2fb2991b-59fb-4d83-85ba-30c1765e7978", "error": "PyTorchJobSpec is not valid: There is no container named pytorch in Master"}

Later I changed the primary container name as pytorch as per PyTorchJobSpec and I can see the pods getting created

[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl get pods 
NAME                                                      READY   STATUS              RESTARTS      AGE
random-experiment-bayesianoptimization-5749c87757-nzg2l   1/1     Running             0             4m24s
random-experiment-kf8zd5hs-master-0                       0/2     ContainerCreating   0             4m2s
random-experiment-kf8zd5hs-worker-0                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-kf8zd5hs-worker-1                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-nfhzbkq4-master-0                       0/2     ContainerCreating   0             4m3s
random-experiment-nfhzbkq4-worker-0                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-nfhzbkq4-worker-1                       0/1     Init:0/1            1 (37s ago)   4m2s
random-experiment-pvwccdnr-master-0                       0/2     ContainerCreating   0             4m3s
random-experiment-pvwccdnr-worker-0                       0/1     Init:0/1            1 (41s ago)   4m3s
random-experiment-pvwccdnr-worker-1                       0/1     Init:0/1            1 (41s ago)   4m3s

Thanks @tenzen-y

Thank you for creating this @bharathappali.
That's correct, you have to name container as pytorch for PyTorchJob.

Did your Katib Trials succeed after you renamed the container name and primaryContainerName?

Yes @andreyvelich I was able to run katib trials after the changes

Thank you! Closing this issue.