raoyongming / HorNet

[NeurIPS 2022] HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Home Page:https://hornet.ivg-research.xyz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Training with submitit] How to use it?

DoranLyong opened this issue · comments

Thanks to share your nice work. Your work is a good baseline for my study :)
I'm trying to train ImageNet-1k using your work, and I checked training on single machine works well following the given instruction:

python -m torch.distributed.launch --nproc_per_node=1 main.py \
--model hornet_tiny_7x7 --drop_path 0.2 --clip_grad 5\
--batch_size 128 --lr 4e-3 --update_freq 4 \
--model_ema true --model_ema_eval true \
--data_path /path/to/imagenet-1k \
--output_dir ./logs/hornet_tiny_7x7

I trying to run the training loop with run_with_submitit.py following the instructions in TRAINING.md, however, nothing happened and I have no idea how to use it.
image
image

is it only for training with multiple gpus?
or is there something I miss?

Hi, thanks for your interest in our work. run_with_submitit.py is only for multi-node training on a slurm cluster like 4x 8-GPU servers. If you want to train our model on a single machine with multiple gpus, we can directly use the python -m torch.distributed.launch command. The results of the two methods for launching experiments should be similar if the same number of GPUs are used.

@raoyongming Thanks for fast response.
then, can I customize it for multi-node training like 4x 2-GPU servers?

Actually, I'm trying by simply changing the arguments about --nodes and --ngpus, but there is some error...

--nodes 4 --ngpus2

image

Because of the error, I changed the --nodes from 4 to 1, then it seems to be under pending
image

I'm really new to use the slurm cluster...
How can I fix it?

It seems the server cannot find other nodes in the cluster. Maybe you can check whether the cluster is correctly configured. We used large clusters that are developed for many users in our experiments. So I am also not sure how to configure a cluster. Besides, if you are using 3090 Ti instead of V100/A100, the communication cost among different nodes can be quite large. In this case, it might be better to train models with a single node.

Okay thanks :)
was great helpful