Multi-GPU Training Not On SLURM

Question

Multi-GPU Training Not On SLURM

luck528 opened this issue 3 years ago · comments

Hello, thanks a lot for your contribution of such a excellent work. I noticed that the distributed multi-gpu training is based on the slurm platform, which is not easy to be run on other platforms. Could you or anyone can provide some tips to change the code from the slurm based code to the non-slurm based one, so that the multi-gpu distributed training can also be conducted on other platforms?

rstrudel · Answer 1 · Tue Sep 14 2021 14:54:35 GMT+0800 (China Standard Time)

Hi @luck528 ,
Thanks for your interest in our work. Indeed we use SLURM environment variables for the distributed training. However, it appears only in two specific places that can be replaced by another scheduler variables, the details are scheduler specific.

Here are the two places where you can find SLURM variables:
https://github.com/rstrudel/segmenter/blob/master/segm/utils/torch.py#L25-L27
https://github.com/rstrudel/segmenter/blob/master/segm/utils/distributed.py#L13-L23

What does these variables mean, and for which variables should you be looking for with another scheduler ?
Let's assume you launch training on 2 nodes with 4 gpus and assume we are on the first gpu of the second node.
SLURM_LOCALID: gpu local rank (=0 as first gpu of the node)
SLURM_PROCID: gpu global rank (=4 as the fifth gpu among the 8)
SLURM_NTASKS: world size (=8, number of individual tasks launched and total num of gpus)
MASTER_ADDR: 127.0.0.1 (keep default, used to communicate between nodes)
MASTER_PORT: 12345 (keep default, used to communicate between nodes)
SLURM_STEP_GPUS: can be ignored

So mainly you need to find the equivalent of SLURM_LOCALID, SLURM_PROCID and SLURM_NTASKS for your scheduler.
The script I use to launch a job with slurm is as follows, values in {} are replaced by values set for the experiment:

#!/bin/bash
#SBATCH --job-name={job_name}

#SBATCH --nodes={nodes}
#SBATCH --ntasks-per-node={gpus_per_node}
#SBATCH --gres=gpu:{gpus_per_node}
#SBATCH --cpus-per-task={cpus_per_gpu}
#SBATCH --hint=nomultithread

#SBATCH --time={time}

# cleaning modules launched during interactive mode
module purge

conda activate {conda_env_name}

mkdir {checkpoint_dir}/{job_name}
srun --output {checkpoint_dir}/{job_name}/%j.out --error {checkpoint_dir}/{job_name}/%j.err \
python -m segm.train \
  --log-dir {checkpoint_dir}/{job_name}/{backbone}_{decoder} \
  --dataset {dataset} \
  --epochs {epochs} \
  --backbone {backbone} \
  --decoder {decoder}

Again, it should be adaptable to run it on your scheduler. I hope this helps.

luck528 · Answer 2 · Tue Sep 14 2021 22:56:06 GMT+0800 (China Standard Time)

Dear author,

Thanks a lot for your detailed and helpful feedback. I am now using the LSF scheduler. I can find the corresponding environment variable in LSF corresponding to SLURM_NTASKS. But in LSF scheduler, I think there are no environment variables corresponding to the SLURM_LOCALID, SLURM_PROCID.

Thanks a lot again!

rstrudel · Answer 3 · Wed Sep 15 2021 16:55:55 GMT+0800 (China Standard Time)

Hi @luck528 ,
I don't know LSF and thus LSF environment variables, probably the best you can do is contact your cluster admin and ask the mapping from slurm to lsf variables. I am sure there is a way but I cannot help you more on this side.
Best,
Robin

Zhi Hou · Answer 4 · Wed Nov 24 2021 21:44:02 GMT+0800 (China Standard Time)

hi, the script that you provide is not useful for a single node? When I use the script in single node, dist.init_process_group is timeout. Could you provide a detailed script for slurm in one node with 4 GPUs.

#!/bin/bash
#SBATCH --job-name=test

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=10

#SBATCH --time=48:00:00

# cleaning modules launched during interactive mode
module purge

conda activate {conda_env_name}

srun \
python -m segm.train --log-dir seg_tiny_mask --dataset ade20k  --batch-size 4 --backbone vit_tiny_patch16_384 --decoder mask_transformer

mashreve · Answer 5 · Sat Jan 14 2023 09:57:10 GMT+0800 (China Standard Time)

I banged my head against the wall for about a day to get this to work.

Not sure if the zhihou7 is around here anymore, but just in case someone else finds this useful, my script is below.

This is for 8 gpus on a single machine (node) with 64 vCPUs.

One other change I made was to manually overwrite the MASTER_ADDR environmental variable in /segm/utils/distributed.py:

os.environ["MASTER_ADDR"] = <insert your host name here>

#!/bin/bash

export SLURM_LOCALID=0
export SLURM_PROCID=0
export SLURM_NTASKS=8
export DATASET=<PATH TO DATASET>
#SBATCH --job-name=seg_base_mask
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:[0-7]
#SBATCH --cpus-per-task=8
#SBATCH --hint=nomultithread
#SBATCH --time=72:00:00
# cleaning modules launched during interactive mode
module purge
# conda activate segmenter
if [ ! -d "./checkpoints/seg_base_mask" ] 
then
echo "Creating Directory ./checkpoints/seg_base_mask"
mkdir ./checkpoints/seg_base_mask
fi
srun --output checkpoints/seg_base_mask/%j.out --error checkpoints/seg_base_mask/%j.err \
python -m segm.train \
--log-dir checkpoints/seg_base_mask/vit_base_patch16_384_mask_transformer \
--epochs 80 \
--batch-size 16 \
--dataset ade20k \
--backbone vit_base_patch16_384 \
--decoder mask_transformer

Zhi Hou · Answer 6 · Tue Jan 17 2023 19:53:37 GMT+0800 (China Standard Time)

Hi @mashreve,
Thanks for your comment. I have solved my problem.
Regards,