How many images and captions are required to train my own CLIP?

Question

How many images and captions are required to train my own CLIP?

gunwooYong opened this issue 3 years ago · comments

Hello, I am Yong, a computer vision researcher.

I was impressed about your code and wondered about how to fine-tune CLIP.
I want to classify images through CLIP.
I only have at most 10 images per class; there are 4-6 classes in total.

In this situation, fine-tuning is possible?

Thank you.

Cade Gordon · Answer 1 · Wed Jul 21 2021 01:17:54 GMT+0800 (China Standard Time)

Hi Yong,

Glad you like the code! The original model had 400 million samples to train for reference. You are most likely to find success by trying zero-shot classification or using linear classifier on top of an existing pretrained model.

The fine tuning code is useful to train any arbitrary models that have been pretrained. For example it's really efficient to train an RN50 trained of ImageNet1k with a pretrained BERT.

You may find success fine-tuning an existing CLIP, but it's likely the model will overfit.

All the best,
Cade