microsoft / esvit

EsViT: Efficient self-supervised Vision Transformers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mode Collapse on Custom Dataset

hayatrajani opened this issue · comments

Hi! First of all kudos on the great work!

So, I am experimenting on a custom dataset of about 70k images consisting of 7 different classes. However, the model seems to collapse after 3-4 epochs of training. I have tried playing around with different embedding dimensions for the out_dim parameter and lower values for teacher_temp to increase sharpening, but in vain.

Have you experimented with smaller datasets? Would you be able to provide any suggestions in this case?

Thanks!

Thanks for trying out the codebase. I have not tried to complete pre-training and evaluate the performance for a smaller dataset, though I usually use the dataset of a similar size (eg ImageWoof) for debugging in a local machine.

Could you please post the your hyper-parameter, dataset settings, and training logs here (eg, what is the behavior of "the model seems to collapse after 3-4 epochs of training")? so that we can access the details and start the discussion.

Thank you for getting back to me on this.

Here are my training logs. In the logs you can also find the entropy (H) and KL divergence for each epoch, both with and without centering and/or sharpening, to look for collapse as suggested by the authors of DINO in their paper. Further, I also tried logging the cosine similarity of output embedings to see if the model collapses to the same representation regardless of the input.

I use all the default settings except out_dim and teacher_temp, which, for this log, were set to 4096 and 0.03 respectively.

I will also try training the model on another similar sized dataset and see if I can debug it.

Thanks for the support!