ajabri / videowalk

Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

Home Page:http://ajabri.github.io/videowalk

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How many GPUs do you used for training?

Steve-Tod opened this issue · comments

Hi, thank you for making the code public.

I use the training and testing command you provided. However, the final test result of the model from the last epoch is about slightly lower than the number you provided: J&F-Mean 67.6(yours) VS 66.9(ours).

I'm guessing the problem might be that you didn't use sync_bn so the batch norm parameters are computed per GPU and maybe I'm using a different number of GPUs compared with you.

So how many GPUs do you use during training?

Hi, @Steve-Tod. How long to train the model with your GPUs? I have never trained the Kinetics 400 dataset. So I wonder if 4 2080ti can train the model. Thanks.

Hi all,

Thanks for your interest! I actually trained my models using 2 2080s. Indeed I did not use sync_bn, and since the effective videos per batch ends up being small, there might be variance in the batch statistics.
Changing the momentum of the batchnorm running stats could also improve transfer stability at convergence, as well as parameter averaging at the end of training.

Best,
Allan

@ajabri Do I understand correctly that 25 epochs were used? How much wall clock time does it take? I'm now downloading the dataset and preparing to run the repro, curious how long it would take

@ajabri Have you used DataParallel or DistributedDataParallel in practice? The released code uses DataParallel, but has a variable model_without_ddp which invokes that DDP (DistributedDataParallel) might have been used.

Have you used the batch size of 20 for training the model? 20 is the default value for batch size in the released code. The paper does not contain a batch size.

Have you used DataParallel or DistributedDataParallel in practice? The released code uses DataParallel, but has a variable model_without_ddp which invokes that DDP (DistributedDataParallel) might have been used.

Have you used the batch size of 20 for training the model? 20 is the default value for batch size in the released code. The paper does not contain a batch size.

I've tried to modify the code into DDP version, because DataParallel is inefficient and GPU utils was low. However, after I change it into DDP, the performance of each epoch drops about 1-2...

For the pretrained model I indeed trained for 25 epochs. That said, the model reaches >95% relative performance in about 1/4 of that training time, from what I observed. With a batchsize of 20, I believe training takes about one week.

I initially used DistributedDataParallel, but switched to DataParallel early on as it was easier to debug. One difference with DDP is the way batchnorm stats might be handled.

@ajabri if possible, could you please share the training curves/wandb logs? this would help a lot initial "debugging" of model extensions...