rishikksh20 / ViViT-pytorch

Implementation of ViViT: A Video Vision Transformer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evaluation results

ed-fish opened this issue · comments

Hi,

Thanks for your work making a Pytorch version of the paper - much appreciated!

How does this implementation compare to results in the original paper. Specifically on the Moments in Time dataset.

Thanks,

Ed

commented

I am also interested at this topic.
If there is anyone could provide me more information about the model parameters that might help me fix the problem, I would be thankful for that because using the default parameters is always overfitting.

Thanks.
Marco

I run the model on a very small dataset(51 classes with 20 video clips per class) and the result is very strange. Always output the same prediction. I wonder if it can be better if I load the pre-trained weights. I would appreciate anyone who can give me some tips.

Thanks,
Dylan

commented

I run the model on a very small dataset(51 classes with 20 video clips per class) and the result is very strange. Always output the same prediction. I wonder if it can be better if I load the pre-trained weights. I would appreciate anyone who can give me some tips.

Thanks,
Dylan

I am facing with the same situation with you. I still don't have any ideat about that now. Waiting for the reply for authors.

I run the model on a very small dataset(51 classes with 20 video clips per class) and the result is very strange. Always output the same prediction. I wonder if it can be better if I load the pre-trained weights. I would appreciate anyone who can give me some tips.
Thanks,
Dylan

I am facing with the same situation with you. I still don't have any ideat about that now. Waiting for the reply for authors.

I wonder whether the problem results from the code or from my too-small dataset.

I tried it with a nearly 2000 Videos. Run different Epochs. But still the accuracy is not more than 21.09%. Strange thing is that it's same for most of the runs. No change in figures.

I tried it with a nearly 2000 Videos. Run different Epochs. But still the accuracy is not more than 21.09%. Strange thing is that it's same for most of the runs. No change in figures.

Your dataset is too small. You can try run ViViT with ViT’s weight loaded for both temporal and spatial part.

@DylanTao94 Can you share how I can do that.

Sry mate, my code is not allowed to share. You can follow the steps in ViViT paper.

yes this model works fine i've tested it on a dataset of 50k videos

@seandatasci i think I might be doing something wrong with the code. Can you help me out here. My code is here

@seandatasci i think I might be doing something wrong with the code. Can you help me out here. My code is here

i have the same problems with you, and i wonder you have resolved the problems whether or not, the acc or auc results is lower than 50%, the dataset size is also 2000, Thank u

Inspired from the implementation of the ViViT by the author, we have reimplement the TimeSformer and ViViT, and release the pretrain-model weights on Kinetics600 can be found here

The model isn't learning. Trained on 2 classes of UCF101 dataset. Adam optimizer, CrossEntropyLoss