[Training with submitit] How to use it?

Question

[Training with submitit] How to use it?

DoranLyong opened this issue 2 years ago · comments

Thanks to share your nice work. Your work is a good baseline for my study :)
I'm trying to train ImageNet-1k using your work, and I checked training on single machine works well following the given instruction:

python -m torch.distributed.launch --nproc_per_node=1 main.py \
--model hornet_tiny_7x7 --drop_path 0.2 --clip_grad 5\
--batch_size 128 --lr 4e-3 --update_freq 4 \
--model_ema true --model_ema_eval true \
--data_path /path/to/imagenet-1k \
--output_dir ./logs/hornet_tiny_7x7

I trying to run the training loop with run_with_submitit.py following the instructions in TRAINING.md, however, nothing happened and I have no idea how to use it.

is it only for training with multiple gpus?
or is there something I miss?

Yongming Rao · Answer 1 · Thu Sep 22 2022 14:49:22 GMT+0800 (China Standard Time)

Hi, thanks for your interest in our work. run_with_submitit.py is only for multi-node training on a slurm cluster like 4x 8-GPU servers. If you want to train our model on a single machine with multiple gpus, we can directly use the python -m torch.distributed.launch command. The results of the two methods for launching experiments should be similar if the same number of GPUs are used.

DoranLyong · Answer 2 · Thu Sep 22 2022 15:24:42 GMT+0800 (China Standard Time)

@raoyongming Thanks for fast response.
then, can I customize it for multi-node training like 4x 2-GPU servers?

Actually, I'm trying by simply changing the arguments about --nodes and --ngpus, but there is some error...

--nodes 4 --ngpus2

Because of the error, I changed the --nodes from 4 to 1, then it seems to be under pending

I'm really new to use the slurm cluster...
How can I fix it?

Yongming Rao · Answer 3 · Thu Sep 22 2022 16:25:44 GMT+0800 (China Standard Time)

It seems the server cannot find other nodes in the cluster. Maybe you can check whether the cluster is correctly configured. We used large clusters that are developed for many users in our experiments. So I am also not sure how to configure a cluster. Besides, if you are using 3090 Ti instead of V100/A100, the communication cost among different nodes can be quite large. In this case, it might be better to train models with a single node.

DoranLyong · Answer 4 · Thu Sep 22 2022 16:43:41 GMT+0800 (China Standard Time)

Okay thanks :)
was great helpful