How many GPUs do you used for training?

Question

How many GPUs do you used for training?

Steve-Tod opened this issue 4 years ago · comments

Hi, thank you for making the code public.

I use the training and testing command you provided. However, the final test result of the model from the last epoch is about slightly lower than the number you provided: J&F-Mean 67.6(yours) VS 66.9(ours).

I'm guessing the problem might be that you didn't use sync_bn so the batch norm parameters are computed per GPU and maybe I'm using a different number of GPUs compared with you.

So how many GPUs do you use during training?

marioxxxx · Answer 1 · Thu Jan 21 2021 17:06:53 GMT+0800 (China Standard Time)

Hi, @Steve-Tod. How long to train the model with your GPUs? I have never trained the Kinetics 400 dataset. So I wonder if 4 2080ti can train the model. Thanks.

A. Jabri · Answer 2 · Fri Jan 22 2021 03:34:00 GMT+0800 (China Standard Time)

Hi all,

Thanks for your interest! I actually trained my models using 2 2080s. Indeed I did not use sync_bn, and since the effective videos per batch ends up being small, there might be variance in the batch statistics.
Changing the momentum of the batchnorm running stats could also improve transfer stability at convergence, as well as parameter averaging at the end of training.

Best,
Allan

Vadim Kantorov · Answer 3 · Fri Jan 22 2021 05:51:40 GMT+0800 (China Standard Time)

@ajabri Do I understand correctly that 25 epochs were used? How much wall clock time does it take? I'm now downloading the dataset and preparing to run the repro, curious how long it would take

Vadim Kantorov · Answer 4 · Sun Jan 24 2021 15:32:08 GMT+0800 (China Standard Time)

@ajabri Have you used DataParallel or DistributedDataParallel in practice? The released code uses DataParallel, but has a variable model_without_ddp which invokes that DDP (DistributedDataParallel) might have been used.

Have you used the batch size of 20 for training the model? 20 is the default value for batch size in the released code. The paper does not contain a batch size.

Zhenyu Jiang · Answer 5 · Sun Jan 24 2021 17:19:57 GMT+0800 (China Standard Time)

Have you used DataParallel or DistributedDataParallel in practice? The released code uses DataParallel, but has a variable model_without_ddp which invokes that DDP (DistributedDataParallel) might have been used.

Have you used the batch size of 20 for training the model? 20 is the default value for batch size in the released code. The paper does not contain a batch size.

I've tried to modify the code into DDP version, because DataParallel is inefficient and GPU utils was low. However, after I change it into DDP, the performance of each epoch drops about 1-2...

A. Jabri · Answer 6 · Tue Jan 26 2021 02:57:25 GMT+0800 (China Standard Time)

For the pretrained model I indeed trained for 25 epochs. That said, the model reaches >95% relative performance in about 1/4 of that training time, from what I observed. With a batchsize of 20, I believe training takes about one week.

I initially used DistributedDataParallel, but switched to DataParallel early on as it was easier to debug. One difference with DDP is the way batchnorm stats might be handled.

Vadim Kantorov · Answer 7 · Tue Jan 26 2021 03:26:04 GMT+0800 (China Standard Time)

@ajabri if possible, could you please share the training curves/wandb logs? this would help a lot initial "debugging" of model extensions...