beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Geometric Parameterization (GmP) of MLP and fine-tuning on CoCo-40k-SPRIGHT (spatially-right long-labels) eliminates typographic attack vulnerability in Long-CLIP (but not short-CLIP 77 tokens) + improves ImageNet/ObjectNet accuracy.

zer0int opened this issue · comments

Dear researchers,

I just wanted to let you know about some findings I made with your amazing Long-CLIP model; while ViT-L/14 (77 tokens) also shows partial mitigation of the typographic attack vulnerability when GmP-fine-tuned on SPRIGHT-CoCo, Long-CLIP seems to show full mitigation of typographic attack vulnerability when fine-tuned with GmP on the full long-labels of CoCo-SPRIGHT.
Exception: Residual vulnerability with non-English text, most likely due to bias / minority in pre-training data.

But, for the classic OpenAI examples of "apple ipod" and "piggy bank poodle":

apple-ipod-demo

poodle-demo

Accuracy on ImageNet/ObjectNet MVT also improved due to GmP-finetune (accuracy scores for LongCLIP-L, GmP-LongCLIP @ Epochs 1/2, @ Epochs final). 10 Epochs, 3-4 hours on 1x RTX4090 @ batch_size 34. I would assume results could be even better for a multi-GPU setup with a more typical batch size!

logit-good-clip-my-long

You can find the full details and code for reproduction in my fork of your repo. Kind regards!

Thanks to your amazing findings! Your research passion and professionalism really impress us. We're glad that our work can make contributions to the community.

Update: By scaling the activation value of an "adverb neuron"[1] in the ViT by a factor of 1000 during fine-tuning, LongCLIP adjusted + found a "better minimum", further boosting accuracy (same dataset I used last time, CoCo-SPRIGHT-40k):

LongCLIP-eval-2

Compare to OpenAI/ViT-L/14, trained in the exact manner (hyperparameters, epochs, manipulation of acts), but with short labels:

eval-clip-gpt4-compare

As anticipated, your model already outperformed the "short-CLIP" (77) for many classes, and I was not able to improve it further for some classes with my fine-tune - whereas I was able to boost performance of OpenAI/CLIP for all classes.

[1] Adverb neuron: A feature (Layer 22, Feature 2432) that, when activation value is scaled x1000, will lead CLIP (gradient ascent -> optimize text embeddings for cosine similarity with image embeddings) to make conclusions about images using "visually meaningless" adverbs. Potentially associated with "text obsession" (typographic attack vulnerability) of model. Scaling this activation x1000 leads to exploding gradients initially; however, the model is surprisingly robust and able to compensate during later epochs, eventually finding a solution that appears superior.

activation-scaling-longclip-ft

Gradient ascent, erratic "CLIP opinion" about image with amplified "adverb neuron":

adverb-neuron-compared

No need to respond, I know you're probably busy - just wanted to share a potentially interesting / relevant new result. =)