raffaello-camoriano / cineca_cheatsheets

Repo for guides publicly available

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CINECA GUIDE

  1. Getting started
  2. Submit a job
  3. Additional infos
  4. Conda and git
  5. Singularity
  6. SLURM

Getting started

For registration and account association follow:

https://wiki.u-gov.it/confluence/display/SCAIUS/UG2.1+Getting+started#expand-3Connectingtothecluster

Command and scripts inside the cluster (CINECA) to submit a job

./train.sh <num_cpu> <max_walltime> (e.g. ./train.sh 12 24:00:00)

train.sh:

	#!/bin/bash
	# >>> Pulling repos
	...
	sbatch --job-name=job_example --cpus-per-task=${1} --time=${2} --output=./slurm_output/job_example.out --error=./slurm_output/job_example.out train.sbatch

train.sbatch:

	#!/bin/bash
	#SBATCH --partition=g100_usr_prod
	#SBATCH --mem=20000M
	#SBATCH --ntasks=1
	#SBATCH --mail-type=ALL
	#SBATCH --mail-user=<your email>

	# >>> IF YOU NEED TO USE A CONTAINER FOLLOW THE CODE BELOW (otherwise use your code here)<<<
	# Load the module
	module load singularity
	# Run the container
	singularity exec --hostname ${SLURM_SUBMIT_HOST}${SLURM_JOB_ID} ./container.sif bash ./container_train.sh

container_train.sh:

	#!/bin/bash
	# >>> Activate the conda enviroment
	...
	# >>> Execute code
	...

CINECA: Additional infos

https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.3%3A+GALILEO100+UserGuide

Cineca allows the usage of TMUX as terminal multiplexer: https://tmuxcheatsheet.com/

Cineca works only offline inside the running node. Therefore:

  • pull the repos before submitting the job (e.g. in train.sh)
  • to use a logger (e.g. wandb):
    • use wandb_mode as offline
    • to sync with the server, inside the wandb folder: wandb sync --include-offline ./offline-*
    • Script for syncronize wandb offline runs (supponing to have a group directory containing more than one run)
            #!/bin/bash
      
            # argument 1: group directory
            conda activate <env_name>
      
            RAND_ID=$(python3 -c "import wandb; print(wandb.util.generate_id());")
      
            echo "Syncing runs $1 to run new id $RAND_ID"
      
            # first, sync last series of logs to new id
            first_dir=$(ls -t $1| head -1)
            wandb sync $1/$first_dir/wandb/$(ls -t $1/$first_dir/wandb/ | grep offline | head -1) --id $RAND_ID
      
            # then all the others + last again to sync hyper-params.
            for dir in $(ls $1)
            do
        		run=$(ls $1/$dir/wandb/ | grep offline)
       			echo $run
       			wandb sync $1/$dir/wandb/$run --id $RAND_ID;
            done
      

Installations of conda and git on the cluster

  • Install conda:
        mkdir -p ~/miniconda3
        wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
        bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
        rm -rf ~/miniconda3/miniconda.sh
        ~/miniconda3/bin/conda init bash
        ~/miniconda3/bin/conda init zsh
    
  • Create a conda environment:
        conda create -n soro_env python=3.8
        conda activate <env_name>
    
  • If singularity is not installed:
        conda install -c conda-forge singularity
    
  • Clone git repositories:
        conda install gh --channel conda-forge
        gh auth login
        <gh token>
        git clone <repo>
    

Singularity: Additional infos

Usually a cluster (e.g. CINECA, HPC) do not allow the use of Docker for security reasons, however it is possible to use Singularity as alternative. Singularity, differently from Docker, creates a container as a directory inside the original host filesystem. Therefore, if you have originally created the Docker container in the path /home/a/b/c, Singularity would virtually create a path /home/a/b/c inside the actual host filesystem. When you use Singularity for the first time you should be take note of these steps:

  • add in ~/.bashrc file:
	  export SINGULARITY_CACHEDIR=/scratch/gpfs/$USER/SINGULARITY_CACHE
	  export SINGULARITY_TMPDIR=/tmp
  • pull docker image <docker_path> and convert into a singularity image <container>.sif (in your login node)
     module load singularity
     singularity pull docker://<docker_path>
  • NOTE: if you are not able to pull it from the cluster, you can copy a pre-existent .sif into the cluster
  • test singularity container using
       singularity shell <container>.sif
    

Useful links:

SLURM cheatsheet

  • submit a job sbatch <file_name>.sbatch
  • show all jobs squeue
  • show your jobs squeue -u <username>
  • show job infos scontrol show job <job_id>
  • partions status sinfo
  • delete a job scancel <job_id>
  • running an interactive session on a node srun --nodes=1 --tasks-per-node=1 --pty /bin/bash

About

Repo for guides publicly available