Mounting PVC to Katib PyTorchJob

Question

Mounting PVC to Katib PyTorchJob

maxencedraguet opened this issue a year ago · comments

/kind bug

Hello! I am trying to run a Katib PyTorchJob to train my model with data stored on a PVC in ReadWriteMany mode (created by a Notebook on Kubeflow). I can run a mock training on a Kubeflow Notebook and also managed to run the code in a Kubernetes pipeline accessing the data in the PVC, so I am confident the code is good.

What steps did you take and what happened:
The yaml config I use for the Katib PyTorchJob is as follows (kubeflow dashboard submission). Some work is done by the master and the worker, so I need to mount the PVC everywhere:

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: mynamespace
  name: pytorchjob-katib-mount
spec:
  parallelTrialCount: 2
  maxTrialCount: 4
  maxFailedTrialCount: 1
  objective:
    type: minimize
    goal: 1.000
    objectiveMetricName: val_jet_classification_loss
  metricsCollectorSpec:
    collector:
      kind: StdOut
  algorithm:
    algorithmName: random
  parameters:
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "10"
        max: "100"
  trialTemplate:
    retain: true
    primaryContainerName: pytorch
    trialParameters:
      - name: batch_size
        description: Batch size for the training model
        reference: batch_size
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: Never
            template:
              metadata:
                labels:
                  mount-kerberos-secret: "true"
                  mount-eos: "true"
                  mount-nvidia-driver: "false"
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: pytorch
                    resources: 
                      limits:
                        nvidia.com/gpu: 0
                    image: my_image
                    command:
                      - "python" 
                      - "execute.py"
                      - "fit"
                      - "--config=config.yaml"
                      - "--data.batch_size=${trialParameters.batch_size}"
                    volumeMounts:
                      - mountPath: "/mypath/"
                        name: data_volume
                volumes:
                  - name: data_volume
                    persistentVolumeClaim:
                      claimName: myclaimname
              
          Worker:
            replicas: 1
            restartPolicy: OnFailure # in case master starts before
            template:
              metadata:
                labels:
                  mount-kerberos-secret: "true"
                  mount-eos: "true"
                  mount-nvidia-driver: "false"
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: pytorch
                    resources: 
                      limits:
                        nvidia.com/gpu: 0
                    image: my_image
                    command:
                      - "python" 
                      - "execute.py"
                      - "fit"
                      - "--config= config.yaml"
                      - "--data.batch_size=${trialParameters.batch_size}"
                    volumeMounts:
                      - mountPath: "/mypath/"
                        name: data_volume
                volumes:
                  - name: data_volume
                    persistentVolumeClaim:
                      claimName: myclaimname

What did you expect to happen:
I was expecting the Katib job to run and correctly read the data from the mounted PVC. In particular, it should instantiate pods for the master and the worker (which it does when I do not mount). Instead, no pods for the master nor the worker are created, with the only pod being that of the Katib submission itself. The log of that pod indicates that 2 trials were returned correctly and that's it:

>kubectl logs pytorchjob-katib-mount-random-6dfdfb446-7nkkn
INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 2 new Trial

Checking the experiment, I can indeed see that these two trials are running but with no more details:

> kubectl get experiment pytorchjob-katib-mount -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2023-10-09T15:25:44Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: pytorchjob-katib-mount
  namespace: mynamespace
  resourceVersion: "1032001899"
  selfLink: /apis/kubeflow.org/v1beta1/namespaces/mynamespace/experiments/pytorchjob-katib-mount
  uid: bb3fa402-dda5-4c30-9dfb-11a67ee3d39a
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 1
  maxTrialCount: 4
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    goal: 1
    metricStrategies:
    - name: val_jet_classification_loss
      value: min
    objectiveMetricName: val_jet_classification_loss
    type: minimize
  parallelTrialCount: 2
  parameters:
  - feasibleSpace:
      max: "100"
      min: "10"
    name: batch_size
    parameterType: int
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: pytorch
    primaryPodLabels:
      job-role: master
    retain: true
    successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
    trialParameters:
    - description: Batch size for the training model
      name: batch_size
      reference: batch_size
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: Never
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
                labels:
                  mount-eos: "true"
                  mount-kerberos-secret: "true"
                  mount-nvidia-driver: "false"
              spec:
                containers:
                - command:
                  - python
                  - execute.py
                  - fit
                  - --config= config.yaml
                  - --data.batch_size=${trialParameters.batch_size}
                  image: my_image
                  name: pytorch
                  resources:
                    limits:
                      nvidia.com/gpu: 0
                  volumeMounts:
                  - mountPath: /mypath/
                    name: data_volume
                volumes:
                - name: data_volume
                  persistentVolumeClaim:
                    claimName: myclaimname
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
                labels:
                  mount-eos: "true"
                  mount-kerberos-secret: "true"
                  mount-nvidia-driver: "false"
              spec:
                containers:
                - command:
                  - python
                  - execute.py
                  - fit
                  - --config=config.yaml
                  - --data.batch_size=${trialParameters.batch_size}
                  image: my_image
                  name: pytorch
                  resources:
                    limits:
                      nvidia.com/gpu: 0
                  volumeMounts:
                  - mountPath: /mypath/
                    name: data_volume
                volumes:
                - name: data_volume
                  persistentVolumeClaim:
                    claimName: myclaimname
status:
  conditions:
  - lastTransitionTime: "2023-10-09T15:25:44Z"
    lastUpdateTime: "2023-10-09T15:25:44Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2023-10-09T15:25:56Z"
    lastUpdateTime: "2023-10-09T15:25:56Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    bestTrialName: ""
    observation:
      metrics: null
    parameterAssignments: null
  runningTrialList:
  - pytorchjob-katib-mount-kssk887m
  - pytorchjob-katib-mount-dr5z94gn
  startTime: "2023-10-09T15:25:44Z"
  trials: 2
  trialsRunning: 2

Furthermore, I am able to list the PVC from a terminal and I see it is correctly accessed by the pipeline test and the notebook that created it but not by the Katib submission (none of the trials are listed under Used by:

> kubectl describe pvc myclaimname
...
Used By: ...

Anything else you would like to add:
If I submit the same Katib yaml without mounting on the master, pods for the master appear (but error due to the data not being accessible) but still not for the worker. I also tried to use a simple example I found on Katib with Job instead of a PyTorchJob (yaml below) and there I see the trials being Created but they are stuck in Created - never running. If I remove the useless PVC mount, the trials do go to running.

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: mynamespace
  name: random-mount
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - sgd
          - adam
          - ftrl
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: numberLayers
        description: Number of training model layers
        reference: num-layers
      - name: optimizer
        description: Training model optimizer (sdg, adam or ftrl)
        reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: docker.io/kubeflowkatib/mxnet-mnist:latest
                command:
                  - "python3"
                  - "/opt/mxnet-mnist/mnist.py"
                  - "--batch-size=64"
                  - "--lr=${trialParameters.learningRate}"
                  - "--num-layers=${trialParameters.numberLayers}"
                  - "--optimizer=${trialParameters.optimizer}"
                resources:
                  limits:
                    memory: "2Gi"
                    cpu: "0.5"
                volumeMounts:
                  - mountPath: /mypath/
                    name: data_vol
            restartPolicy: Never
            volumes:
              - name: data_vol
                persistentVolumeClaim:
                  claimName: myclaimname

Environment:

> kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

OS (uname -a): Linux 5.12.7-300.fc34.x86_64 #1 SMP Wed May 26 12:58:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Katib version (check the Katib controller image version): TBC

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

maxencedraguet · Answer 1 · Wed Oct 11 2023 00:04:48 GMT+0800 (China Standard Time)

Something I think could be important: I use S3 from the python code running on the workers. The keys are passed with the configuration yaml used by python. This caused no issue on the pipeline example I mentioned (on my S3 bucket, I see the logs the pipeline run created).

maxencedraguet · Answer 2 · Fri Oct 13 2023 16:37:46 GMT+0800 (China Standard Time)

Hey all! So it's quite silly. This is not valid because of the "_" separator.

volumes:
  - name: data_vol

Somehow 0 error are returned when running with a config like this and the pod just don't instantiate, which makes debugging impossible without prior knowledge of this rule.

maxencedraguet · Answer 3 · Fri Oct 13 2023 17:15:51 GMT+0800 (China Standard Time)

Hello! So this is quite silly: one should not use "_" in the volumes.name (I put data_vol). Somehow 0 errors are returned by this and it's just that the pod where the volume is mounted do not initialise. Could be useful to add an error.