This repository covers an approach to run Kubernetes workloads in a Slurm cluster. The approach uses Kind (Kubernetes in Docker) to set up temporary Kubernetes clusters. Kind supports rootless Podman which is a valid choice for HPC use cases. This project is part of my master’s thesis at the Georg August University of Göttingen. The goal of the thesis is to investigate approaches to run Kubernetes workloads in a Slurm cluster.
Limitation: In its current state, this project does not support running a single distributed workload across multiple Slurm nodes. So far, a temporary Kubernetes cluster node can not communicate with another node running on a different Slurm node.
First, the Slurm cluster has to be up and running. Also, a shared storage among all cluster nodes (e.g. NFS) has to be present. This project aims for RHEL 9 x86 distributions, but may work on other RHEL distributions as well. Apart from that, all nodes have to have certain software installed:
- Bash
- Podman
- slirp4netns
- Kind
- Kubectl
- shadow-utils
Also, all nodes must ensure certain configurations:
- cgroups v2 is enabled
- CPU delegation is enabled
- Kernel modules
ip6_tables
,ip6table_nat
,ip_tables
,iptable_nat
are loaded
The initial setup instructions to ensure the prerequisites can be found in Setup.md.
- Clone this repository in a shared directory that is present on all nodes
cd
into the directory- As an example, run:
srun -N1 /bin/bash run-workload.sh $PWD/example-workloads/workload-pod-sysbench/workload-pod-sysbench.sh
The script run-workload.sh provides users the option to execute user-defined Kubernetes workloads as jobs on a Slurm cluster. To do so, users can write a custom Linux shell script that creates workloads using kubectl. The script run-workload.sh handles setting up a temporary Kubernetes cluster inside a container using Kind, then executes the Kubernetes workload (user-defined workload script), and finally deletes the cluster when the workload is finished. It supports multi-tenant usage - so multiple users can create multiple clusters and can use them separately. Also, a single user can create multiple Slurm jobs leading to multiple clusters in parallel on the same node.
To enable access to files on the host machine inside a Kubernetes workload,
the current working directory of the host machine is shared with the Kubernetes cluster container.
Inside the container it is available in /app
. In a Kubernetes workload this directory can be included using a volume.
The script workload-job-pytorch.sh gives an example on how the shared directory may be used.
As mentioned before, users can write scripts that describe the workload. Inside the script, kubectl
is available for usage.
How can the right clusters be selected in case of multiple Slurm jobs?
During creating the Kubernetes cluster a random name is picked for the cluster.
This name is available in the workload script through the variable K8S_CLUSTER_NAME
and can be used in kubectl
to reference the correct cluster e.g. kubectl get jobs --context "$K8S_CLUSTER_NAME"
.
To create Kubernetes resources, one can utilize kubectl create --context "$K8S_CLUSTER_NAME"
followed by the resource just as in normal Kubernetes clusters.
Another important part of a workload script is that it also waits for the workloads to be completed (e.g. by using kubectl wait --context "$K8S_CLUSTER_NAME"
).
Otherwise, the cluster will be deleted without finishing the workload first.
Generally, it is a clean practice to delete the resources in a last step.
However, this is not strictly necessary due to the fact that the whole Kubernetes cluster is deleted in the end.
In workload scripts, the Kubernetes cluster can also be accessed by the Kubernetes REST API. For this use case, two environment variables are provided: $K8S_CLUSTER_API
amd $K8S_CLUSTER_API_TOKEN
. The file workload-kube-api.sh provides an example. The token grants access to the service account admin-user
, which has bound the role cluster-admin
.
Overall following variables are available inside workload scripts:
Variable Name | Description |
---|---|
K8S_CLUSTER_NAME | Name of the Kind cluster. Can be used in kubectl --context |
K8S_CLUSTER_API | URL of the Kubernetes API |
K8S_CLUSTER_API_TOKEN | Token for the Kubernetes API |
K8S_PORT | Port that is shared with the host machine. This port is selected on runtime from the range 30000 to 32767, in case it is not set beforehand. It can be used e.g. in a Kubernetes service - workload-pod-nginx.sh gives an example. |
Following workload script is a minimal example:
# Create workloads
kubectl create --context "$K8S_CLUSTER_NAME" namespace example
kubectl create --context "$K8S_CLUSTER_NAME" -n example -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: hello
spec:
template:
spec:
securityContext:
runAsUser: 0
containers:
- name: hello
image: alpine
command: ['echo', 'hello world']
volumeMounts:
- name: project-vol
mountPath: /app
restartPolicy: OnFailure
volumes:
- name: project-vol
hostPath:
path: /app
type: Directory
EOF
# Wait for workloads to finish
kubectl wait --context "$K8S_CLUSTER_NAME" -n example --for=condition=complete --timeout=10h job/hello
# Print workload logs
kubectl logs --context "$K8S_CLUSTER_NAME" -n example job/hello
# Delete workloads
kubectl delete --context "$K8S_CLUSTER_NAME" namespace example
Further examples of workload scripts are included in the directory example-workloads
:
- workload-pod-sysbench.sh: Runs a CPU benchmark. Gives also an example on how pods can be utilized, although it could also be implemented using a job.
- workload-job-pytorch.sh: Runs a PyTorch training and stores the resulting model on the node in the directory
./kubernetes-pytorch/out/
- workload-yaml.sh: Runs a hello-world job defined in a
yaml
file - workload-kube-api.sh: Queries the Kubernetes REST API using curl
- workload-pod-nginx.sh: Runs an nginx webserver. This serves as an example how a service running on one node can be accessed from another node.
In general, the script can run without root privileges. Also, the path to your Kubernetes workload script has to be passed as an argument. Here, the script workload-pod-sysbench.sh is used as an example. Run the following command from the project root directory to use Slurm to execute the workload:
srun -N1 /bin/bash run-workload.sh $PWD/example-workloads/workload-pod-sysbench/workload-pod-sysbench.sh
To utilize the full compute power of a machine, additional Slurm arguments may be needed. The following arguments allow the job to use 56 CPU cores:
srun -N1 -c56
One can also use sbatch
to run KSI. The following batch script batch-ksi.sh
serves as an example:
#!/bin/bash
# batch-ksi.sh
#SBATCH --nodes=1
srun -N1 /bin/bash run-workload.sh $PWD/example-workloads/workload-pod-sysbench/workload-pod-sysbench.sh
Run the following command from the project root directory:
sbatch -D $PWD batch-ksi.sh
In fact, the script can also operate without Slurm:
/bin/bash run-workload.sh $PWD/example-workloads/workload-pod-sysbench/workload-pod-sysbench.sh
To store the stdout and stderr in a file you can add following tee
command:
/bin/bash run-workload.sh $PWD/example-workloads/workload-pod-sysbench/workload-pod-sysbench.sh |& tee log.txt
To set up an interactive Kubernetes cluster in a Slurm job run:
TODO
Ideas:
- Slurm job that creates a cluster (fire and forget) that can be used from login node. May need to implement some function to delete the cluster on job cancellation.
- Interactive slum job
To list all Kubernetes clusters run:
KIND_EXPERIMENTAL_PROVIDER=podman kind get clusters
To list all Kubernetes nodes run:
KIND_EXPERIMENTAL_PROVIDER=podman kind get nodes
Alternatively, you can gain insight on your existing Kubernetes clusters by listing all Podman containers:
podman ps -a
In case a Slurm job fails, you might encounter a still running Kubernetes cluster. To delete this cluster you need to find out the name first. Then you can run:
KIND_EXPERIMENTAL_PROVIDER=podman kind delete cluster --name "cluster_name"
or for some distributions, you might need to use systemd-run to start kind into its own cgroup scope
KIND_EXPERIMENTAL_PROVIDER=podman systemd-run --scope --user kind delete cluster --name "cluster_name"
Inside a Kubernetes pod or job, a permission denied error may occur. This usually means that the user is has no permissions to access a file or directory.
A cause for this may be the directory mapping in the kind config kind-config-template.yaml or the (un)set user in the pod or job.
Some container images may have set up a non-root user, that executes the application inside the container. This fact can lead to the error mentioned above. To solve this, explicitly set the user in the Kubernetes pod to root by adding:
spec:
# ...
securityContext:
runAsUser: 0
# ...
To debug this you may run:
kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: alpine
spec:
securityContext:
runAsUser: 0
containers:
- name: alpine
image: alpine
command: ['ls', '-aln', '/app']
volumeMounts:
- name: project-vol
mountPath: /app
restartPolicy: OnFailure
volumes:
- name: project-vol
hostPath:
path: /app
type: Directory
EOF
kubectl logs pod/alpine
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
This error seems to occur, when the machine does not have sufficient resources left. Each machine can only handle a certain number of Kind clusters.
Fix try to run workload on another cluster or delete other clusters first.