[NeurIPS 2023] Improving CLIP Training with Language Rewrites

This repo contains text data, code and pre-trained models for paper Improving CLIP Training with Language Rewrites. If you find the data, models or code useful, please consider citing our paper:

@inproceedings{fan2023improving,
  title={Improving CLIP Training with Language Rewrites},
  author={Fan, Lijie and Krishnan, Dilip and Isola, Phillip and Katabi, Dina and Tian, Yonglong},
  booktitle={NeurIPS},
  year={2023}
}

Overview:

We propose Language augmented CLIP (LaCLIP). LaCLIP enhances CLIP training by rewriting text descriptions with large language models. Key steps:

Meta-Input-Output Generation: we explored different strategies for generating meta-input-output pairs that can be used as examples in the prompt context for LLaMA in-context learning, namely ChatGPT, Bard, MSCOCO and Human. Examples of generating such pairs with ChatGPT:

In-Context Learning with LLaMA: Utilizing the constructed context input as a prompt, LLaMA exhibits its ability to perform text completion and generate rewritten versions of the corresponding text samples. This process is conducted for each text sample present in the pre-training image-text dataset. Example of LLaMA rewriting a text sample:

Pre-trained Models

Dataset	Method	Zero-Shot	Checkpoint
CC3M	CLIP	15.8	ViT-B/16
CC3M	LaCLIP	21.5	ViT-B/16
CC12M	CLIP	40.2	ViT-B/16
CC12M	LaCLIP	48.4	ViT-B/16
RedCaps	CLIP	42.9	ViT-B/16
RedCaps	LaCLIP	46.2	ViT-B/16
LAION-400M	CLIP	62.0	ViT-B/32
LAION-400M	LaCLIP	64.4	ViT-B/32
LAION-400M	CLIP	67.0	ViT-B/16
LAION-400M	LaCLIP	69.3	ViT-B/16
LAION-400M	CLIP	71.8	ViT-L/14
LAION-400M	LaCLIP	74.5	ViT-L/14

Code Overview

Code for generating rewrites of text samples
4 versions of augmented text on 3 datasets (CC3M, CC12M, RedCaps)
Pre-trained models with LaCLIP and vanilla CLIP
Zero-shot evaluation code on ImageNet
Code for training LaCLIP

Dependencies

PyTorch 1.11.0
torchvision 0.12.0
timm 0.5.4
open_clip (optional, for LAION-400M models)
LLaMA (for generating rewrites)

Pre-computed Augmented Texts

Original is the original caption associated with each image.
ChatGPT/Bard/MSCOCO/Human is the text generated by LLaMA ICL with the ChatGPT/Bard/MSCOCO/Human Meta-Input-Output pairs as in-context learning examples.

Dataset	Original	ChatGPT	Bard	MSCOCO	Human
CC3M	Link	Link	Link	Link	Link
CC12M	Link	Link	Link	Link	Link
RedCaps	Link	Link	Link	Link	Link

Rewrite for other datasets

In order to rewrite for other datasets of your own interest, we put the code for rewriting in the rewrite folder. Please refer to Meta LLaMA page for detailed instructions for model access and environment setup. The rewritten text could be generated by running the following command:

export LLAMA_FOLDER=/PATH/TO/LLAMA/WEIGHTS
export PYTHONPATH=/PATH/TO/LLAMA/
export model='7b'
torchrun --nproc_per_node 1 --master_port 12388 \
    llama_rewrite.py --ckpt_dir ${LLAMA_FOLDER}/${model} --tokenizer_path ${LLAMA_FOLDER}/${model}/tokenizer.model \
    --max_batch_size 100 --max_seq_len 400 --prompt_filename text/source.txt --output_filename text/target.txt --sample_mode chatgpt --temperature 0.9

Main Arguments

--prompt_filename: text file to be rewritten, each line is one sentence
--output_filename: output path
--sample_mode: sample mode for in-context learning (chatgpt, bard, mscoco, or human)
--temperature: temperature for sampling, higher temperature leads to more diverse text

Zero-shot Evaluation on ImageNet

To perform zero-shot evaluation on ImageNet, use the following command:

For CC3M, CC12M and RedCaps models:

python eval_zeroshot_imagenet.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model CLIP_VITB16 --batch-size 128 --workers 8

For LAION-400M models:

python eval_zeroshot_imagenet_laion.py --imagenet-root [PATH_TO_IMAGENET] --ckpt-path [PATH_TO_CHECKPOINT] --model [ViT-B-32, ViT-B-16 or ViT-L-14] --batch-size 128 --workers 8

add --quickgelu for ViT-L-14 models.

Training LaCLIP

To train LaCLIP, use the following command:

torchrun --nproc_per_node=GPU_PER_NODE --nnodes=NUM_NODE --node_rank=NODE_RANK \
  --master_addr=MASTER_NODE --master_port=PORT \
  train.py \
    --train-data PATH/TO/TRAINING/CSV \
    --root PATH/TO/TRAINING/IMAGE/ROOT \
    --imagenet-root PATH/TO/IMAGENET/ROOT \
    --aug-text --augmented_caption_filelist PATH/TO/AUGMENTED/CAPTION/FILES  \
    --output-dir PATH/TO/OUTPUT \
    --model CLIP_VITB16 \
    --lr 1e-3 --wd 0.5  --warmup-epochs 1 --batch-size 256 --epochs 35

Main Arguments

--train-data: csv file for training data, each line is one image-text pair, with the relative image path and original caption separated by a comma
--root: root dir for images
--imagenet-root: root dir for ImageNet, used for zero-shot evaluation
--aug-text: whether to use augmented text
--augmented_caption_filelist: text files for augmented text, each line is one sentence, the order of the sentences should be the same as the order of the images in --train-data. Seperate the augmented text files with a space for multiple augmented text files.
--output-dir: saving dir for logs and checkpoints
--model: CLIP backbone architecture

Additional Notes

Make sure the sample order in the --augmented_caption_filelist is the same as the order in --train-data.
Please refer to Table A3 in the paper for the hyperparameters used for each dataset.
To train vanilla CLIP, comment out the --aug-text and --augmented_caption_filelist arguments.

kiminh / LaCLIP

[NeurIPS 2023] Improving CLIP Training with Language Rewrites

Overview:

Pre-trained Models

Code Overview

Dependencies

Pre-computed Augmented Texts

Rewrite for other datasets

Main Arguments

Zero-shot Evaluation on ImageNet

Training LaCLIP

Main Arguments

Additional Notes

About

Languages