ML research is more efficient if we...
- Automate experiment logging and parameter search
- Offload computation to a compute cluster
This repository provides Python and Shell script examples demonstrating how to run experiments on a HPC cluster using wandb to log and analyze experiments.
The ML example is based on a Diffrax supervised neural ODE regression example. For a brief introduction to using W&B on a cluster, I found the following video helpful.
Weights and biases, aka W&B, is a platform for ML research that allows to:
- Track experiments (configurations, objective values)
- Visualize the training process online (e.g. objective functions and gradients)
- Track data sets and models
- Automate parameter search
- Share results with a team
Click here for a brief introduction to wandb. To use wandb you need to:
- Create a free account (e.g. with an academic license)
- Install the wandb library using pip
- Specify the experiment hyperparameters in your python script using a simple command from wandb as well as store your loss using another wandb function.
- Run the experiment. Open your wandb account in a browser and marvel at all the shiny plots of your experiments.
To get started with wandb, I created a Google Colab. In this example, wandb is used to:
- Log a dataset after creation on the wandb server
- Load a data set before training from the wandb server
- Log the training configurations
- Log the optimization results
- Log the model parameters and gradient steps
- Automate parameter search with W&B sweeps
After playing around with the code, you can download the python files and shell scripts in the folder cluster
and move on to using the HPC cluster.
A cluster enables us to scale computation if needed.
For using the cluster (in my case the cluster of the RWTH Aachen), I found the following ressources helpful:
- Introduction to linux with nice Youtube videos. It helps to be used to working with console commands and VIM
- Getting started with HPC
- Intro to job scheduling
- RWTH HPC mainpage
- RWTH cluster - slurm commands
- RWTH cluster - account infos
- RWTH cluster - Batch script examples
The folder cluster
contains the following files:
basic_example.py
- Contains the same code as the notebook. Here we want to run the functionsmain()
orcreate_dataset
.sb.sh
- A simple shell script that activates a conda environment and executes a command such as executing a Python script.sb_arr.sh
- Same assb.sh
but runs an array of jobs.sweep_config.yaml
- Contains the settings for the W&B sweep.
In what follows, we do the following jobs on the cluster:
- Run the script
basic_example.py
using python-fire - Setup a W&B sweep to do automatic parameter search on the cluster and manually assign compute nodes on the cluster on which the W&B agent executes
basic_example.py
- Basically the above step, but we use a "job array" to run many jobs for the W&B sweep in parallel.
Useful console commands:
- Check job status of user
squeue -u <user_id> --start
- Check specific job status
squeue --job <job_id>
- Get job efficiency report
seff JOBID
- Cancel job
scancel <job_id>
- Get info on job history
sacct
- Get manual on sbatch
man sbatch
- Get partition information
sinfo -s
- Check consumed core hours
r_wlm_usage -p <cluster-project-name> -q
To use a HPC cluster you need to:
- Create an HPC account, e.g. for the RWTH cluster via RegApp
- Optional: Setup cluster login using SSH key
- Setup ssh key
- Navigate to
.ssh
in home directory - Create key
ssh-keygen
and specify PW
- Navigate to
- Send key to cluster
ssh-copy-id -i <public-key-name> <user-name>@login18-1.hpc.itc.rwth-aachen.de
- Setup ssh key
- Optional: Create
config
file in.ssh
with SSH preset:host <short-name> HostName <host-name> User <username> TCPKeepAlive yes ForwardX11 yes
- Write a shell script that instructs SLURM what you want the cluster's compute nodes to do. Check scheduling basics.
- Shebang of the batch script must be
#!/usr/bin/zsh
on RWTH cluster
- Shebang of the batch script must be
- Clone project from git
git clone <git-project-URL>
- If authentication fails see here
- Setup environment using conda:
- Install miniconda on login node
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
$ conda config --set auto_activate_base false
2) Ifconda
command is not found executeexport PATH="/home/<username>/miniconda/bin:$PATH"
with your username. 3) Install conda environment from yaml fileconda env create --file conda_env.yaml
4) Activate environmentconda activate <environment name>
- Install miniconda on login node
In a console window (Linux/ Mac OS) run the following commands to start a job:
- Connect to the cluster login node (after connecting to the RWTH vpn) via
ssh <username>@login18-1.hpc.itc.rwth-aachen.de
- Start a screen session which keeps your linux session running in case your SSH disconnects:
- Start session
screen -S <session-name>
- Reattach to linux screen 10386
screen -r 10386
- List running screen sessions
screen -ls
- List all windows
<Ctrl+a> "
- Close the current region
<Ctrl+a> X
- Detach from screen session
<Ctrl+a> d
- Start session
- Activate conda environment
conda activate <env>
and login into wandbwandb login
- Execute a shell script
sbatch sb.sh <python-call> <script-input parameters>
e.g.sbatch sb.sh python3 basic_example.py --seed 123
sbatch sb.sh python3 -c 'import basic_example; create_dataset(seed_123)'
- Optional: Transfer files from cluster node to local computer
- Using secure copy for single files, e.g.
$ scp your_username@remotehost.edu:foobar.txt /some/local/directory
- Using rsync for multiple files
- Using secure copy for single files, e.g.
Troubleshooting:
- Inside the batch script
sb.sh
you need to specify the right conda environment. - The folders
errors
andoutputs
must be created inside your project folder usingmkdir <folder name>
W&B sweeps
To let W&B do the parameter search for you...
- Open W&B in your browser, navigate to
Projects / <project name> / Sweeps
and click on "Create Sweep" - Copy "sweep_config.yaml" (in the repo folder
cluster
) and paste it into the browser. Click on "Intialize sweep" - In the console execute
sbatch sb.sh wandb agent <name-of-sweep-as-in-wand-browser>
as often as you want. W&B will do the parameter selection and tell the compute nodes what to run. If you want to run many jobs at ones, it is better to use SLURM job arrays as detailed below.
Running W&B sweeps using SLURM job arrays
A sweep might run for a long time, depending on how many parameters shall be checked. If a job is too long, it takes more time until the job gets a free slot on the cluster and potentially costs more cluster credits. We avoid these issues by using SLURM job arrays in the shell script sb_arr.sh
. Here, we run in total 30 jobs á 10 minutes with 5 jobs running at the same time by using the command...
#SBATCH --array=1-30%5
Starting a wandb sweep with SLURM job arrays is the same as before while the sbatch command slightly changed...
sbatch sb_gpu_arr.sh wandb agent <name-of-sweep-as-in-wand-browser>`
For a very long simulation, you can use the shell command to tell SLURM that you want to run 30 jobs after each other...
#SBATCH --array=1-30%1
...and call the shell script as follows...
sbatch sb_arr_altered.sh python3 <simulation-file-name> -statefile=simstate.state
If the simulation does not finish in time it will be followed by the next array task, which picks up right at where the simulation left (by reading in "simstate.state").
Running SLURM job arrays using GPU and CUDA
The file sb_gpu_arr.sh
illustrates how to use the RWTH cluster with CUDA. Make sure that the correct conda environement is being used and that the ML library installed in this environment is compatible with the specified CUDA version.
Helpful console commands:
- List CUDA version available on cluster:
module spider CUDA
- Check CUDA version:
nvcc --version
- Find CUDA path:
$CUDA_ROOT
- The python-fire library automatically generates command line interfaces (CLIs) from python objects, e.g. assume you have a Python file "main.py"...
def main(batch_size=32, seed=5678):
"""some program"""
if __name__ == '__main__':
fire.Fire(main)
... then you can run main from the console via...
python main.py --seed=42
-
You can open a second console and connect to the same remote computer. Then, the console command
top
provides a dynamic real-time view of the system. -
The only way for Slurm to detect success or failure of running the Python program is the exit code of your job script. You can store the exitcode after executing the program to prevent it from being overwritten...
#!/bin/bash
#SBATCH …
myScientificProgram …
EXITCODE=$?
cp resultfile $HOME/jobresults/
/any/other/job/closure/cleanup/commands …
exit $EXITCODE