LauzHack LLM Training: Getting started with the EPFL Clusters

Welcome to the LLM training track at LauzHack! This repository contains the basic steps to start running your model and script on the EPFL GPU Cluster (called RCP) -- so that you don't have to go through the countless documentations by yourself!

We provide you access to the RCP cluster. It has A100 (80GB) GPUs. The job system is using run:ai (scheduler on top of Kubernetes).

For starters, you should first go through the minimal basic setup right below here.

If you come up with any question about the cluster or the setup that you do not find answered here, try and talk to other participants, or google :) Also, please do not hesitate to reach out to any of your colleagues or the organizers (e.g. on Discord), we are happy to help you out.

Tip

We know the setup below may seem daunting at first. But don't worry, the guide tries to make it as simple as possible for you by providing you all commands in order, and with a script that does most of the work for you. You do not need to understand the commands in detail. The only requirement is that you have a basic understanding of how to use a terminal and git.

Caution

You are sharing the cluster with other teams. Please be mindful of the resources you use. Do not forget to stop your jobs when not used!

Minimal basic setup

The step-by-step instructions for first time users to quickly get a job running.

Tip

After completing the setup, the TL;DR of the interaction with the cluster is:

Get a running job with one GPU that is reserved for you: python csub.py -n sandbox
Connect to a terminal inside your job: runai exec sandbox -it -- zsh
Run your code: cd /mloscratch/<your username>; python src/main.py --wandb
In one go, you can also do: python csub.py -n sandbox --train --command "cd /mloscratch/<your username>/llm-baselines; python src/main.py --wandb"

Important

Make sure you are on the EPFL wifi or connected to the VPN. The cluster is otherwise not accessible.

1: Pre-setup (access, repository)

Group access: For each team, you need to have one EPFL member to get access to the cluster. Share this one member's epfl email with us, to add you to the group runai-mlo-lauzhack (https://groups.epfl.ch/) by sending a message on the Discord LauzHack thread. Because of limited GPU resources, we can only add one person per team, so make sure to coordinate this within your team.

Prepare your code: While you are waiting to get access, create a fork of the llm-baselines repository. This is the fork that you will use to run your experiments and submit your solution. You can already checkout some of the code if you want to get a head start. Make sure you add all of your team members to the repository.

Prepare Weights and Biases: For logging the results of your experiments, you can use Weights and Biases. Create an account if you don't already have one. You will need an API key to later log your experiments.

The following are just a bunch of commands you need to run to get started. If you do not understand them in detail, you can copy-paste them into your terminal :)

2: Setup the tools on your own machine

Important

The setup below was tested on macOS with Apple Silicon. If you are using a different system, you may need to adapt the commands. For Windows, we have no experience with the setup and thereby recommend WSL (Windows Subsystem for Linux) to run the commands.

Install kubectl. To make sure the version matches with the clusters (status: 15.12.2023), on macOS with Apple Silicon, run the following commands. For other systems, you will need to change the URL in the command above (check https://kubernetes.io/docs/tasks/tools/install-kubectl/). Make sure that the version matches with the version of the cluster!

    # Sketch for macOS with Apple Silicon.
    # Download a specific version (here 1.26.7 for Apple Silicon macOS)
    curl -LO "https://dl.k8s.io/release/v1.26.7/bin/darwin/arm64/kubectl"
    # Give it the right permissions and move it.
    chmod +x ./kubectl
    sudo mv ./kubectl /usr/local/bin/kubectl
    sudo chown root: /usr/local/bin/kubectl

Setup the kube config file: Create a file in your home directory as ~/.kube/config and copy the contents from the file kubeconfig.yaml in this file. Note that the file on your machine has no suffix.

Install the run:ai CLI:

   # Sketch for macOS with Apple Silicon
   # Download the CLI from the link shown in the help section.
   wget --content-disposition https://rcp-caas-test.rcp.epfl.ch/cli/darwin
   # Give it the right permissions and move it.
   chmod +x ./runai
   sudo mv ./runai /usr/local/bin/runai
   sudo chown root: /usr/local/bin/runai

3: Login

Once you have access to the cluster and the tools installed, you can start setting up the cluster and running a test job.

   # Setup the RCP cluster
   runai config cluster rcp-context
   # Login to the cluster
   runai login
   # Check that things worked fine
   runai list projects
   # put your default project
   runai config project mlo-lauzhack-$YOUR_GASPAR_USERNAME

Next, we provide a script in this repository to make your life easier to get started. Then we show you how to run the llm-baselines code.

4: Use this repo to start a LLM training run with your fork of the llm-baselines code

Clone this repository and create a user.yaml file in the root folder of the repo using the template in templates/user_template.yaml.

git clone https://github.com/epfml/getting-started-lauzhack.git
cd getting-started-lauzhack
touch user.yaml # then copy the content from templates/user_template.yaml inside here and update

Fill in user.yaml with your username, userID in user.yaml and also update the working_dir with your username. You can find this information in your profile on people.epfl.ch (e.g. https://people.epfl.ch/alexander.hagele) under “Administrative data”. Also important for logging, get an API key from Weights and Biases and add it to the yaml.
Create a pod with 1 GPU (you may need to install pyyaml with pip install pyyaml first).

python csub.py -n sandbox

Wait until the pod has a 'running' status -- this can take a bit (max ~5 min or so). Check the status of the job with

runai list # shows all jobs
runai describe job sandbox # shows the status of the job sandbox

When it is running, connect to the pod with the command:

runai exec sandbox -it -- zsh

If everything worked correctly, you should be inside a terminal on the cluster!

5: Cloning and running the LLM training code

Clone your fork of the llm-baselines repository into the pod inside your home folder.

# Inside the pod
cd /mloscratch/<your_username>
git clone https://github.com/<your username>/llm-baselines.git
cd llm-baselines

Now you can run the code as you would on your local machine. For example, to run the train.py script, you can use the following command:

# Inside the pod, inside /mloscratch/<your username>/llm-baselines
python src/main.py --wandb --wandb_project lauzhack-llm

Hopefully, this should work and you're up and running! You should be able to track your job on your wandb dashboard (and see the loss going down :) )

For remote development (changing code, debugging, etc.), we recommend using VSCode. You can find more information on how to set it up in the VSCode section.

Tip

Generally, the workflow we recommend is simple: develop your code locally or on the cluster (e.g. with VS Code), then push it to your repository. Once you want to try, run it on the cluster with the terminal that is attached via runai exec sandbox -it -- zsh. This way, you can keep your code and experiments organized and reproducible.

Note that your pods can be killed anytime. This means you might need to restart an experiment (with the python csub.py command we give above). You can see the status of your jobs with runai list. If a job has status "Failed", you have to delete it via runai delete job sandbox before being able to start the same job again.

Keep your files inside your home folder: Importantly, when a job is restarted or killed, everything inside the container folders of ~/ are lost. This is why you need to work inside /mloscratch/<your username>.

To have a job that can run in the background, do python csub.py -n sandbox --train --command "cd /mloscratch/<your username>/llm-baselines; python src/main.py --wandb"

The rest of the README is more detailed information on the cluster, the scripts, and the setup.

Using VSCODE

To easily attach a VSCODE window to a pod we recommend the following steps:

Install the Kubernetes and Dev Containers extensions.
From your VSCODE window, click on Kubernetes -> rcp-context -> Workloads -> Pods, and you should be able to see all your running pods.
Right-click on the pod you want to access and select Attach Visual Studio Code, this will start a vscode session attached to your pod.
The symlinks ensure that settings and extensions are stored in mloscratch/<your username> and therefore shared across pods.
Note that when opening the VS code window, it opens the home folder of the pod (not scratch!). You can navigate to your working directory (code) by navigating to /mloscratch/<your username>.

You can also see a pictorial description here.

Extra details on usage of the cluster

This is more info on the GPU cluster, the scripts, and the setup. You do not need to read this (unless you're interested in the details or want to go beyond the interactive workflow above). It is a copy of our internal documentation.

Using the python script to launch jobs

The python script csub.py is a wrapper around the run:ai CLI that makes it easier to launch jobs. It is meant to be used for both interactive jobs (e.g. notebooks) and training jobs. General usage:

python csub.py --n <job_name> -g <number of GPUs> -t <time> --command <cmd> [--train]

Check the arguments for the script to see what they do.

What this repository does on first run:

We provide a default docker image mlo/mlo:v1 that should work for most use cases. If you use this default image, the first time you run csub.py, it will create a working directory with your username inside /mloscratch/homes. Moreover, for each symlink you find the user.yaml file the script will create the respective file/folder inside mloscratch and link it to the home folder of the pod. This is to ensure that these files and folders are persistent across different pods.
- Small caveat: csub.py expects your image to have zsh installed.

If you want to have another workflow, you can also directly use the run:ai CLI together with other docker images. See this section for more details.

Note to LauzHack participants: We do not recommend you to change the default image.

Managing pods

After starting pods with the script, you can manage your pods using run:ai and the following commands:

runai exec pod_name -it -- zsh # - opens an interactive shell on the pod 
runai delete job pod_name # kills the job and removes it from the list of jobs
runai describe job pod_name # shows information on the status/execution of the job
runai list jobs # list all jobs and their status 
runai logs pod_name # shows the output/logs for the job
runai config cluster ic-context # switch to IC cluster context
runai config cluster rcp-context # switch to RCP cluster context

Some commands that might come in handy (credits to Thijs):

# Clean up succeeded jobs from run:ai.
runai list | grep " Succeeded " | awk '{print $1}' | parallel runai delete job {}
# Overview of active jobs that fits on your screen.
runai list jobs | sed '1d' | awk '{printf "%-42s %-20s\n", $1, $2}'
# Auto-updating listing of jobs and their states.
watch -n 10 "runai list | sed 1d | awk '{printf \"%0-40s %0-20s\n\", \$1, \$2}'"

Important notes and workflow

We provide the script in this repo as a convenient way of creating jobs.

The default job is just an interactive one (with sleep) that you can use for development.
- 'Interactive' jobs are a concept from run:ai. Every user can have 1 interactive GPU. They have higher priority than other jobs and can live up to 12 hours. You can use them for debugging. If you need more than 1 GPU, you need to submit a training job.
For a training job, use the flag --train, and replace the command with your training command. Using a training job allows you to use more than 1 GPU (up to 8 on one node). Moreover, a training job makes sure that the pod is killed when your code/experiment is finished in order to save money.

Of course, the script is just one suggested workflow that tries to maximize productivity and minimize costs -- you're free to find your own workflow, of course. For whichever workflow you go for, keep these things in mind:

Important

Work within /mloscratch. This is the shared storage that is mounted to your pod.
- Create a directory with your GASPAR username in /mloscratch/ folder. This will be your personal folder. Except under special circumstances, all your files should be kept inside your personal folder (e.g. /mloscratch/nicolas if your username is nicolas) or in your personal home folder (e.g. /mloscratch/homes/nicolas).**
- Should you use the csub.py script, the first run will automatically create a working directory with your username inside /mloscratch/homes.
- Suggestion: use a GitHub repo to store your code and clone it inside your folder.
Moving things onto the cluster or between folders can also be done easily via HaaS machine. For more details on storage, see file management.
Remember that your job can get killed anytime if run:ai needs to make space for other users. Make sure to implement checkpointing and recovery into your scripts.
CPU-only pods are cheap, approx 3 CHF/month, so we recommend creating a CPU-only machine that you can let run and use for code development/debugging through VSCODE.
When your code is ready and you want to run some experiments or you need to debug on GPU, you can create one or more new pods with GPU. Simply specify the command in the python launch script.
Using a training job makes sure that you kill the pod when your code/experiment is finished in order to save money.

File management

Reminder: the cluster uses kubernetes pods, which means that in principle, any file created inside a pod will be deleted when the pod is killed.

To store files permanently, you need to mount network disks to your pod. In our case, this is mloscratch -- all code and experimentation should be stored there. Except under special circumstances, all your files should be kept inside your personal folder (e.g. /mloscratch/nicolas if your username is nicolas) or in your personal home folder (e.g. /mloscratch/homes/nicolas). Scratch is high-performance storage that is meant to be accessed/mounted from pods. Even though it is called "scratch", you do not need to generally worry about losing data (it is just not replicated across multiple hard drives).

For very secure long-term storage, we have:

mlodata1.
- This is long term storage, backed up carefully with replication (i.e. stored on multiple hard drives). This is meant to contain artifacts that you want to keep for an undetermined amount of time (e.g. things for a publication).
mloraw1
- Not clear right now how this will be used in the future (status: 15.12.2023).

Caution

You cannot mount mlodata or mloraw on pods. Use the haas machine below to access it.

Moving data onto/between storage

Currently, if you need to move additional things to scratch, you need to do this manually via a machine provided by IT:

  # For basic file movement, folder creation, or
  # copying from/to mlodata1 to/from scratch:
  ssh <gaspar_username>@haas001.rcp.epfl.ch

The volume is mounted inside the folder /mnt/mlo/scratch. You can copy files between them using cp or rsync.

Quick links

IC Cluster

Docs: https://icitdocs.epfl.ch/display/clusterdocs/IC+Cluster+Documentation
Dashboard: https://epfl.run.ai
Docker registry: https://ic-registry.epfl.ch/harbor/projects
Getting started guide: https://icitdocs.epfl.ch/display/clusterdocs/Getting+Started+with+RunAI

RCP Cluster

RCP main page: https://www.epfl.ch/research/facilities/rcp/
Docs: https://wiki.rcp.epfl.ch
Dashboard: https://rcpepfl.run.ai
Docker registry: https://registry.rcp.epfl.ch/
Getting started guide: https://wiki.rcp.epfl.ch/en/home/CaaS/Quick_Start

run:ai docs: https://docs.run.ai

If you want to read up more on the cluster, you can checkout a great in-depth guide by our colleagues at CLAIRE. They have a similar setup of compute and storage:

Compute and Storage @ CLAIRE

More background on the cluster usage, run:ai and the scripts

We go through a few more details on the cluster usage, run:ai and the scripts in this section, and provide alternative ways to use the cluster.

Alternative workflow: using the run:ai CLI and base docker images with pre-installed packages

The setup in this repository is just one way of running and managing the cluster. You can also use the run:ai CLI directly, or use the scripts in this repository as a starting point for your own setup. For more details, see the the dedicated readme.

Creating a custom docker image

In case you want to customize it and create your own docker image, follow these steps:

Request registry access: This step is needed to push your own docker images in the container. Try login here https://ic-registry.epfl.ch/ and see if you see members inside the MLO project. The groups of runai are already added, it should work already. If not, reach out to Alex or a colleague.
Install Docker: brew install --cask docker (or any other preferred way according to the docker website). When you execute commands via terminal and you see an error '“Cannot connect to the Docker daemon”', try running docker via GUI the first time and then the commands should work.
Login registry: docker login ic-registry.epfl.ch and use your GASPAR credentials. Same for the RCP cluster: docker login registry.rcp.epfl.ch (but we're currently not using it).

Modify Dockerfile:**

The repo contains a template Dockerfile that you can modify in case you need a custom image
Push the new docker using the script publish.sh
Remember to rename the image (mlo/username:tag) such that you do not overwrite the default one

Additional example: Alternatively, Matteo also wrote a custom one and summarized the steps here: https://gist.github.com/mpagli/6d0667654bf8342eb4923fedf731660e

He created an image that runs by default under his Gaspar user ID and group ID. You can find those IDs in e.g. https://people.epfl.ch/matteo.pagliardini under 'donnees administratives'.
Upload your image to EPFL's registry

docker build . -t <your-tag>
docker login ic-registry.epfl.ch -u <your-epfl-username> -p <your-epfl-password> # use your epfl credentials
docker tag <your-tag> ic-registry.epfl.ch/mlo/<your-tag>
docker push ic-registry.epfl.ch/mlo/<your-tag>

Port forwarding

If you want to access a port on your pod from your local machine, you can use port forwarding. For example, if you want to access a jupyter notebook running on your pod, you can do the following:

kubectl get pods
kubectl port-forward <pod_name> 8888:8888

File overview of this repository

├── utils
    ├── entrypoint.sh             # Sets up credentials and symlinks
    ├── conda.sh                  # Conda installation   
    └── extra_packages.txt        # Extra python packages you want to install 
├── csub.py                       # Creates a pod through run:ai; you can specify the number of GPUS, CPUS, docker image and time 
├── templates
    ├── user_template.yaml                 # Template for your user file COPY IT, DO NOT CHANGE IT
├── Dockerfile                    # Dockerfile example  
├── publish.sh                    # Script to push the docker image in the registry
├── kubeconfig.yaml               # Kubeconfig that you should store in ~/.kube/config
└── README.md                     # This file
├── docs
  ├── faq.md                        # FAQ
  └── runai_cli.md                  # Run:ai CLI guide

epfml / getting-started-lauzhack

LauzHack LLM Training: Getting started with the EPFL Clusters

Minimal basic setup

1: Pre-setup (access, repository)

2: Setup the tools on your own machine

3: Login

4: Use this repo to start a LLM training run with your fork of the llm-baselines code

5: Cloning and running the LLM training code

Using VSCODE

Extra details on usage of the cluster

Using the python script to launch jobs

Managing pods

Important notes and workflow

File management

Moving data onto/between storage

Quick links

More background on the cluster usage, run:ai and the scripts

Alternative workflow: using the run:ai CLI and base docker images with pre-installed packages

Creating a custom docker image

Port forwarding

File overview of this repository

About

Languages