Survival guide for HPC system of HHU
Authors: Nikolas Adaloglou, Julius Ramarkes
For the whole guide use your HHU university login name + password. For access to the computers you need to enable openvpn:
$ sudo openvpn HHU-VPN.ovpn
Start here
-
Fill out the form to get access to the computer
-
On a Linux terminal run
$ssh <username>@hpc.rz.uni-duesseldorf.de
. In my case$ssh adaloglo@hpc.rz.uni-duesseldorf.de
-
See all the install software/libraries like CUDA versions and python libraries (Tensorflow, Pytorch) with
$ module avail
-
Modify the bashrc to load the modules you need
$ nano ~/.bashrc
Note that ''#" is used to comment out i.e. if you need to switch between TF versions
## comments start with # like: # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi if env | grep -q ^JPY_API_TOKEN= then export MODULES_FOR_JUPYTER=1 #module load TensorFlow/1.14 #module load TensorFlow/2.4.0 module load PyTorch/1.8.0 #module load Horovod module load APEX #module load BioPython # add you modules here!!!!! fi
You can load modules directly in the terminal like module load PyTorch/1.8.0
. Then if you type $ python
you should be able to import torch
-
Find you project folder
$ cd /gpfs/project/<username>
or$ cd /gpfs/project/projects/<yourproject>
.For example in my case
$ cd /gpfs/project/adaloglo/ && ls
-
For savings (checkpoints, statistics, figures etc) use:
$ /gpfs/scratch/<username>/...
Then, you will need to transfer your code + data in this folder. Before that, take a look at the resource management:
General purpose stuff - resource management
One way to allocate resources from the Jupyter hub of HPC but it's not recommended.
SLUMR systems work with job launching.
- You can see the running jobs
qstat
- To see your jobs
qstat -u <yourloginname>
- You can cancel your job with
qdel <jobid>
Then you need to find your subserver i.e. BioJob queue
For instance our BioJob servers are three systems each having 10 GPU's hilbert302, hilbert309 m hilbert310
are their names.
- To see the available resources per server type: $
pbsnodes <servername>
like:pbsnodes hilbert310
Outputs:
Mom = hilbert310.hilbert.hpc.uni-duesseldorf.de
ntype = PBS
state = free
pcpus = 20
resources_available.accelerator_model = gtx1080ti
resources_available.arch = broadwell
resources_available.host = hilbert310
resources_available.hpmem = 0b
resources_available.ibswitch = 0xe41d2d0300450000,all
resources_available.mem = 256648mb
resources_available.ncpus = 20
resources_available.ngpus = 10
resources_available.Qlist = cuda
resources_available.vmem = 256648mb
resources_available.vnode = hilbert310
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Mon Aug 16 10:48:18 2021
last_used_time = Thu Aug 5 04:09:10 2021```
As you can see all 10 GPU's are free right now!
Data transfer using scp
For extensive file transfer it is recommended to use storage.hpc.rz.uni-duesseldorf.de, you can copy and paste using scp:
You can run the commands from your local terminal!
Send files using scp
$ scp file.zip <username>@storage.hpc.rz.uni-duesseldorf.de:/gpfs/project/<username>/<project_folder>
Receive files using scp
$ scp <username>@storage.hpc.rz.uni-duesseldorf.de:/gpfs/project/<username>/<project_folder>/<file.zip> ./
Alternative methods are Filezilla for Linux.
- German guide
Launch jobs
Given everything else is set up you now need to specify a run_my_code.sh
file for $ qsub
$ qsub run_my_code.sh
Below is an example of run_my_code.sh
#!/bin/bash
#PBS -l select=1:ncpus=4:mem=10gb:ngpus=2:accelerator_model=gtx1080ti # or a100
#PBS -l walltime=167:59:00
#PBS -A "DeepGenome" # modify accordingly
#PBS -r n
#PBS -q "BioJob" # specify your server if you know where to launch your script
#PBS -m e
#PBS -M <username>@uni-duesseldorf.de
set -e
module load CUDA/10.0.130
module load Python/3.6.5
module load TensorFlow/1.14
python /gpfs/project/<username>/<project_folder>/main.py # modify
You can create your file with $ touch run_my_code.sh
and then $ nano run_my_code.sh
Launch interactive sessions on HPC to test your code
In the ssh HPC login shell you can also interactively ask for more resources, e.g. 2CPUs and 5GB would be:
qsub -A DeepGenome -I -l select=1:ncpus=2:mem=15G:ngpus=2:accelerator_model=gtx1080ti -l walltime=12:00:00
Install your Python packages locally
- Here is a list of trusted python packages that can be installed. If the package that you need is not available you need to contact the HPC team at
hpc-support [at] uni-duesseldorf.de
You can locally install the packages like this:
pip install --upgrade --target=<your-project/user-folder>/site-packages -i http://pypi.repo.test.hhu.de/simple/ --trusted-host pypi.repo.test.hhu.de <your-awesome-package>
Here is an example to use einops
module load module load PyTorch/1.8.0
pip install --upgrade --user --target=/gpfs/project/adaloglo/site-packages -i http://pypi.repo.test.hhu.de/simple/ --trusted-host pypi.repo.test.hhu.de einops
If you don't want to specify the --target
:
pip install --upgrade --user -i http://pypi.repo.test.hhu.de/simple/ --trusted-host pypi.repo.test.hhu.de einops
Install requirements.txt locally
The best practice here is to be very specific about your pip package versions. Here is an example of my requirements.txt
file:
torch==1.8.1
torchvision==0.9.1
pandas>=1.1.5
omegaconf>=2.0.6
lightning-bolts==0.3.3
pytorch-lightning==1.3.2
einops>=0.3.0
scikit-learn>=0.24.2
module load Python/3.6.5
module load CUDA/10.2.89
pip install --upgrade --user -i http://pypi.repo.test.hhu.de/simple/ --trusted-host pypi.repo.test.hhu.de -r requirements.txt
Then you can open a python terminal and test your imports. If the packages are not recognized (not observed so far) consider exporting the --target path to PYTONPATH or if you use the default:
pip install --upgrade --user -i http://pypi.repo.test.hhu.de/simple/ --trusted-host pypi.repo.test.hhu.de scikit-learn
export PYTHONPATH=/home/<username>/.local/lib/python3.6/site-packages:$PYTHONPATH
Debug/Test you code before launching the job
Now that your data and code is in the server time to see what's happening. Start an interactive session with minimal requirements:
qsub -A DeepGenome -I -l select=1:ncpus=2:mem=10G:ngpus=1:accelerator_model=gtx1080ti -l walltime=2:00:00
Set your number of workers and you available GPU's accordingly. You can type nvidia-smi
to see the available requested GPU. A good practice is to do this before any other import:
os.environ['CUDA_VISIBLE_DEVICES'] = "5"
The example uses the GPU with GPU_ID 5, based on nvidia-smi
.
Now you can execute the lines of the .sh script:
module load Python/3.6.5
module load PyTorch/1.8.0
cd /gpfs/project/<username>/<project_folder>
pip install --upgrade --user -i http://pypi.repo.test.hhu.de/simple/ --trusted-host pypi.repo.test.hhu.de -r requirements.txt
#python -m hello_world # to tests imports
python -m main_hpc_code_script
Unzip your data
After sending your data over you have to check the md5sum
hash code to see if the data were transmitted correctly:
md5sum file.zip
The output should match your local file.
It is reccomended to unzip your files in an interactive session:
qsub -A DeepGenome -I -l select=1:ncpus=1:mem=15G -l walltime=3:00:00
Unzipping options
unzip file.zip
if that doesn't work you can try:
module load Java/11.0.2
jar xf file.zip
Another alternative is to re-zip with tar and unzip it like this:
tar --create --verbose --gzip --file=CheX.tar.gz file*/
tar xzvf CheX.tar.gz
NOT for HPC: Develop interactively with Jupyter server
ssh <username>@servername
Remote - server side
-
create virtual environment
virtualenv --python=python3.6 ~/test_env
orvirtualenv-3 --python=python3.6 ~/test_env
-
load virtualenv
$ source ~/test_env/bin/activate
-
$ pip install jupyter ipykernel
-
install project requirements, including jupyter
pip install -r requirements.txt
-
start jupyter server:
jupyter notebook --no-browser --port=8889
jupyter notebook --no-browser --port=8889 >~/jupyter_lab_output_8889_$(date +"%Y%m%d").txt 2>&1 &
-
Look into the newly generated text file -> web URL for local browser.
-
To access the jupyter, you first need to set up the client side
-
When you do so you can probably copy & paste the shown URLs in your local browser
Local/client side:
- Port forwarding, here e.g. for port 8916:
$ ssh -N -f -L localhost:8888:localhost:8889 <username>@<server_id> $ ssh -L 8916:localhost:8916 <username>@<server_id>
- Local browser URL open:
localhost:8888
orlocalhost:8916
(token authentication required)