openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

On a Bag-of-Words baseline and a transformer

iburenko opened this issue · comments

Dear authors,

Thank you very much for the great work!

I am trying to understand, how one could obtain the bag-of-words representations for a caption that are described in Sec 2.3:

we explored training
a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which
image and not the exact words of that text. Starting with
the same bag-of-words encoding baseline, we swapped the
predictive objective for a contrastive objective in Figure 2
and observed a further 4x efficiency improvement in the rate
of zero-shot transfer to ImageNet.

I wonder how this bag-of-words baseline is trained with the transformer. I guess that we could avoid using positional embeddings at the training phase (obviously, we use them during inference), making the activations of the last layer of the transformer at [EOS] token context-free and, therefore, interpreting them as BoW embeddings. Is this what is happening, or are these BoW representations calculated somehow differently?