beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fine-tuning Long-CLIP on SPRIGHT - Spatially Right datasets?

zer0int opened this issue · comments

First of all, thank you for making the weights and code of this awesome CLIP publicly available - I really appreciate it!

Your paper mentions that you used 200 text-image pairs of urban scenes with long captions. Have you considered using the SPRIGHT - Spatially Right datasets for further fine-tuning?

The authors claim they have seen significant improvements on as little as 500 text-image pairs, albeit they created & made public 2.3 million images with long captions (except LAION as it's currently under review; but as researchers, I am sure you could request it).

I fine-tuned CLIP, including ViT-L/14, on SPRIGHT COCO 40k (capped to 77 token captions), and I noticed the model - similar to your models - is much more likely to predict "arrow symbols" for a given image, compared to original pre-trained CLIP models (gradient ascent, optimize for text embeddings cosine similarity with image embeddings -> CLIP 'opinion' about image).

SPRIGHT fine-tune predicting arrow symbols:

arrow-token-prediction

Your model (-B):

catpizza

This may be absolutely arbitrary and not related to anything meaningful, but: On the other hand, CLIP often uses emojis and other symbols in meaningful ways, as well as very long "longwords" - and they translate to salient concepts for e.g. guiding Stable Diffusion / SDXL. Current models are easier to prompt than they used to be, but "a CLIP knows what a CLIP sees" best, still. Prompting SDXL with CLIP's strange gradient ascent predicted "longwords" leads to excellent reproduction of given images, when judged against attention heatmaps to select best tokens / words.

258921093-92fc4b11-f1ad-4278-8f73-bc016cce4afa

Alas, maybe the "arrow token" attached to, or independent of, other tokens is actually a meaningful emergent representation of spatial relationships, too...?

PS: I have added the gradient ascent script for Long-CLIP to my fork of your repo.

Thanks again for your great work!

Thanks for your recognition and your comments, which do give us lots of insights. We may try using the SPRIGHT datasets for further fine-tuning later.

Again, thanks for your discussion.