Mounting PVC to Katib PyTorchJob
maxencedraguet opened this issue · comments
/kind bug
Hello! I am trying to run a Katib PyTorchJob to train my model with data stored on a PVC in ReadWriteMany mode (created by a Notebook on Kubeflow). I can run a mock training on a Kubeflow Notebook and also managed to run the code in a Kubernetes pipeline accessing the data in the PVC, so I am confident the code is good.
What steps did you take and what happened:
The yaml
config I use for the Katib PyTorchJob is as follows (kubeflow dashboard submission). Some work is done by the master and the worker, so I need to mount the PVC everywhere:
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: mynamespace
name: pytorchjob-katib-mount
spec:
parallelTrialCount: 2
maxTrialCount: 4
maxFailedTrialCount: 1
objective:
type: minimize
goal: 1.000
objectiveMetricName: val_jet_classification_loss
metricsCollectorSpec:
collector:
kind: StdOut
algorithm:
algorithmName: random
parameters:
- name: batch_size
parameterType: int
feasibleSpace:
min: "10"
max: "100"
trialTemplate:
retain: true
primaryContainerName: pytorch
trialParameters:
- name: batch_size
description: Batch size for the training model
reference: batch_size
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
metadata:
labels:
mount-kerberos-secret: "true"
mount-eos: "true"
mount-nvidia-driver: "false"
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
resources:
limits:
nvidia.com/gpu: 0
image: my_image
command:
- "python"
- "execute.py"
- "fit"
- "--config=config.yaml"
- "--data.batch_size=${trialParameters.batch_size}"
volumeMounts:
- mountPath: "/mypath/"
name: data_volume
volumes:
- name: data_volume
persistentVolumeClaim:
claimName: myclaimname
Worker:
replicas: 1
restartPolicy: OnFailure # in case master starts before
template:
metadata:
labels:
mount-kerberos-secret: "true"
mount-eos: "true"
mount-nvidia-driver: "false"
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
resources:
limits:
nvidia.com/gpu: 0
image: my_image
command:
- "python"
- "execute.py"
- "fit"
- "--config= config.yaml"
- "--data.batch_size=${trialParameters.batch_size}"
volumeMounts:
- mountPath: "/mypath/"
name: data_volume
volumes:
- name: data_volume
persistentVolumeClaim:
claimName: myclaimname
What did you expect to happen:
I was expecting the Katib job to run and correctly read the data from the mounted PVC. In particular, it should instantiate pods for the master and the worker (which it does when I do not mount). Instead, no pods for the master nor the worker are created, with the only pod being that of the Katib submission itself. The log of that pod indicates that 2 trials were returned correctly and that's it:
>kubectl logs pytorchjob-katib-mount-random-6dfdfb446-7nkkn
INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 2 new Trial
Checking the experiment, I can indeed see that these two trials are running but with no more details:
> kubectl get experiment pytorchjob-katib-mount -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
creationTimestamp: "2023-10-09T15:25:44Z"
finalizers:
- update-prometheus-metrics
generation: 1
name: pytorchjob-katib-mount
namespace: mynamespace
resourceVersion: "1032001899"
selfLink: /apis/kubeflow.org/v1beta1/namespaces/mynamespace/experiments/pytorchjob-katib-mount
uid: bb3fa402-dda5-4c30-9dfb-11a67ee3d39a
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 1
maxTrialCount: 4
metricsCollectorSpec:
collector:
kind: StdOut
objective:
goal: 1
metricStrategies:
- name: val_jet_classification_loss
value: min
objectiveMetricName: val_jet_classification_loss
type: minimize
parallelTrialCount: 2
parameters:
- feasibleSpace:
max: "100"
min: "10"
name: batch_size
parameterType: int
resumePolicy: LongRunning
trialTemplate:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
primaryContainerName: pytorch
primaryPodLabels:
job-role: master
retain: true
successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
trialParameters:
- description: Batch size for the training model
name: batch_size
reference: batch_size
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
mount-eos: "true"
mount-kerberos-secret: "true"
mount-nvidia-driver: "false"
spec:
containers:
- command:
- python
- execute.py
- fit
- --config= config.yaml
- --data.batch_size=${trialParameters.batch_size}
image: my_image
name: pytorch
resources:
limits:
nvidia.com/gpu: 0
volumeMounts:
- mountPath: /mypath/
name: data_volume
volumes:
- name: data_volume
persistentVolumeClaim:
claimName: myclaimname
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
mount-eos: "true"
mount-kerberos-secret: "true"
mount-nvidia-driver: "false"
spec:
containers:
- command:
- python
- execute.py
- fit
- --config=config.yaml
- --data.batch_size=${trialParameters.batch_size}
image: my_image
name: pytorch
resources:
limits:
nvidia.com/gpu: 0
volumeMounts:
- mountPath: /mypath/
name: data_volume
volumes:
- name: data_volume
persistentVolumeClaim:
claimName: myclaimname
status:
conditions:
- lastTransitionTime: "2023-10-09T15:25:44Z"
lastUpdateTime: "2023-10-09T15:25:44Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
- lastTransitionTime: "2023-10-09T15:25:56Z"
lastUpdateTime: "2023-10-09T15:25:56Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
bestTrialName: ""
observation:
metrics: null
parameterAssignments: null
runningTrialList:
- pytorchjob-katib-mount-kssk887m
- pytorchjob-katib-mount-dr5z94gn
startTime: "2023-10-09T15:25:44Z"
trials: 2
trialsRunning: 2
Furthermore, I am able to list the PVC from a terminal and I see it is correctly accessed by the pipeline test and the notebook that created it but not by the Katib submission (none of the trials are listed under Used by:
> kubectl describe pvc myclaimname
...
Used By: ...
Anything else you would like to add:
If I submit the same Katib yaml
without mounting on the master, pods for the master appear (but error due to the data not being accessible) but still not for the worker. I also tried to use a simple example I found on Katib with Job instead of a PyTorchJob (yaml
below) and there I see the trials being Created but they are stuck in Created - never running. If I remove the useless PVC mount, the trials do go to running.
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: mynamespace
name: random-mount
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- Train-accuracy
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
- name: num-layers
parameterType: int
feasibleSpace:
min: "2"
max: "5"
- name: optimizer
parameterType: categorical
feasibleSpace:
list:
- sgd
- adam
- ftrl
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLayers
description: Number of training model layers
reference: num-layers
- name: optimizer
description: Training model optimizer (sdg, adam or ftrl)
reference: optimizer
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:latest
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=${trialParameters.learningRate}"
- "--num-layers=${trialParameters.numberLayers}"
- "--optimizer=${trialParameters.optimizer}"
resources:
limits:
memory: "2Gi"
cpu: "0.5"
volumeMounts:
- mountPath: /mypath/
name: data_vol
restartPolicy: Never
volumes:
- name: data_vol
persistentVolumeClaim:
claimName: myclaimname
Environment:
> kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
-
OS (
uname -a
): Linux 5.12.7-300.fc34.x86_64 #1 SMP Wed May 26 12:58:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux -
Katib version (check the Katib controller image version): TBC
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
Something I think could be important: I use S3 from the python code running on the workers. The keys are passed with the configuration yaml used by python. This caused no issue on the pipeline example I mentioned (on my S3 bucket, I see the logs the pipeline run created).
Hey all! So it's quite silly. This is not valid because of the "_" separator.
volumes:
- name: data_vol
Somehow 0 error are returned when running with a config like this and the pod just don't instantiate, which makes debugging impossible without prior knowledge of this rule.
Hello! So this is quite silly: one should not use "_" in the volumes.name (I put data_vol
). Somehow 0 errors are returned by this and it's just that the pod where the volume is mounted do not initialise. Could be useful to add an error.