Tuunv2

Our consists of 4 key architectural components, which need to be put together in order to run:

Docker: This is used at the fundamental tool for running containers. Each docker container is a self-contained operating system
Kubernetes/Microkubernetes (aka. K8s/microk8s): This serves as a resource provisioner to manage how computation resources are assigned to different container instances or collections, called pods
Argo (specifically Argo Workflows): This is used for executing pipelines, it's core advantage being its inbuilt support for memoisation of experimental results
Katib: This serves as an experiment scheduler. In Katib you generally assign a range of parameters, and Katib will choose the best parameters to optimise your results - we extend its functionality by incorporating it into a pipeline tuning setting.

The diagram below shows a flow chart explaining our stack:

1. Docker

Docker can installed using the official instructions.

nvidia-docker2 is also required. It can be installed using official nvidia instruction
Your goal should be to run an official nvidia docker container with a simple nvidia-smi command. For example, sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi should display the output of nvidia-smi, run from within the docker container for cuda version 11.0.3. Other versions can be found at nvidia/cuda on docker hub
If installed correctly, your output will look like this (driver versions may differ):

2. Kubernetes

We recommend 2 readily available option for Kubernetes usage:

A. Kubernetes: This is the full production ready version of Kubernetes. It be installed using the official documentation.

B. Microk8s: microkubernetes is a lightweight version of kubernetes. It is not suitable in production, but is quick to setup for rapid prototyping following official documentation.

Most kubernetes commands can be run with microk8s - you just need to add the keyword microk8s before the command. e.g microk8s kubectl get nodes should list the nodes in the cluster.

GPU support for kubernetes needs to be enabled separately. Likewise for microk8s. For microk8s, you need to work with precise software versions for GPU to work - we tested using microk8s 1.22, on Ubuntu 18.04. The nvidia drive was 470.103.01, with cuda 11.4

Points to note when setting up microk8s:

microk8s enable dns storage: make sure to run these commands while installing microk8s as per the docs linked above. Otherwise argo will not install.
microk8s enable gpu: This command enables gpu in microk8s. If run correctly, a number of pods under the namesapce gpu-operator-resources will be started and their status will be either "running" or "completed", after just a few minutes. Furthermore, when we run microk8s kubectl describe node, "nvidia.com/gpu" will be listed under the allocated resources, as shown below:

Some useful starter commands in kubernetes are:

kubectl get pods: list all running pods
kubectl describe pod pod-name: shows all statistics of a particular pod, which is very useful for debugging
kubectl logs pod-name: shows the outputs produced from running a container

3. Argo

The Argo project has several tools, but we use mainly Argo Workflows. Argo Workflows quickstart page provides useful steps to get it up and running. Here are some helpful notes to provide you with additional installation advice:

kubectl create ns argo && kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/master/manifests/quick-start-postgres.yaml : If Argo server is running correctly, when we run kubectl get pods -A, there should be 4 pods running under the namespaces argo like the image below. It is okay if these pods restart a few times, but they should all be up and running within ~2-5 minutes. Otherwise there may be an error.
kubectl -n argo port-forward deployment/argo-server 2746:2746 : this will serve the argo dashboard on port 2746. If you are running the code on a remote server, make sure that you are port forwarding from the server to your computer i.e by adding -L 2746:localhost:2746 to your ssh command. The Argo dashboard looks like this:
argo version: When you download a version of argo from the releases, the output of this command should look like this:
Environment variables: Setting environment variables such as ARGO_INSECURE_SKIP_VERIFY is required for proper permissions in workflow submission. argo --help provides more information on these variables, as can be see in the User tab to the left of the Argo dashboard (localhost:2746/userinfo). Several environment variables can be set in one-shot by making adjustments to your bashrc file like below:

# recommended in User panel cat >> ~/.bashrc <<EOL
export ARGO_SERVER='127.0.0.1:2746'
export ARGO_HTTP1=true
export ARGO_SECURE=true
export ARGO_BASE_HREF=
export ARGO_TOKEN=''
export ARGO_NAMESPACE=argo
export ARGO_INSECURE_SKIP_VERIFY=true
EOL

#check it works
argo list
hello world: argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/hello-world.yaml If Argo workflows is correctly installed, you should be able to submit the "hello world" command using the above instruction. If you click on the corresponding workflow which shows up in the argo dashboard, you should see a picture of a whale! This is the docker version of the famous "cowsay" feature in linux.

3.1 Regarding Volumes and Memory Usage

In order to have allocated disk space for Argo to write artifacts to disk, you may been to create a persistent volume (pv), and then a persistent volume claim (pvc), which you can pass to argo.
Sample yaml's for declaring a pv can be found in the volumes folder of this repository. For example, you could run kubectl apply -f argo-pv.yaml
Note that both a pvc and its corresponding pv needs to be applied to particular namespace. Argo will not be able to use memory under the persistent volume claim if it does not belong to the same namespace as the argo server and pods. And a persistent volume claim will not be able to find a persisent volume to bind to if it is not under the same namespace.

samunaai / tuunv2

Tuunv2

1. Docker

2. Kubernetes

3. Argo

3.1 Regarding Volumes and Memory Usage

About

Languages