Fine-tuning Long-CLIP on SPRIGHT - Spatially Right datasets?

Question

Fine-tuning Long-CLIP on SPRIGHT - Spatially Right datasets?

zer0int opened this issue 6 months ago · comments

First of all, thank you for making the weights and code of this awesome CLIP publicly available - I really appreciate it!

Your paper mentions that you used 200 text-image pairs of urban scenes with long captions. Have you considered using the SPRIGHT - Spatially Right datasets for further fine-tuning?

The authors claim they have seen significant improvements on as little as 500 text-image pairs, albeit they created & made public 2.3 million images with long captions (except LAION as it's currently under review; but as researchers, I am sure you could request it).

I fine-tuned CLIP, including ViT-L/14, on SPRIGHT COCO 40k (capped to 77 token captions), and I noticed the model - similar to your models - is much more likely to predict "arrow symbols" for a given image, compared to original pre-trained CLIP models (gradient ascent, optimize for text embeddings cosine similarity with image embeddings -> CLIP 'opinion' about image).

SPRIGHT fine-tune predicting arrow symbols:

Your model (-B):

This may be absolutely arbitrary and not related to anything meaningful, but: On the other hand, CLIP often uses emojis and other symbols in meaningful ways, as well as very long "longwords" - and they translate to salient concepts for e.g. guiding Stable Diffusion / SDXL. Current models are easier to prompt than they used to be, but "a CLIP knows what a CLIP sees" best, still. Prompting SDXL with CLIP's strange gradient ascent predicted "longwords" leads to excellent reproduction of given images, when judged against attention heatmaps to select best tokens / words.

Alas, maybe the "arrow token" attached to, or independent of, other tokens is actually a meaningful emergent representation of spatial relationships, too...?

PS: I have added the gradient ascent script for Long-CLIP to my fork of your repo.

Thanks again for your great work!

Beichen Zhang · Answer 1 · Thu Apr 18 2024 14:18:11 GMT+0800 (China Standard Time)

Thanks for your recognition and your comments, which do give us lots of insights. We may try using the SPRIGHT datasets for further fine-tuning later.

Again, thanks for your discussion.