Pic2Word (CVPR2023)

This is an open source implementation of Pic2Word. This is not an officially supported Google product.

Data

Training Data

We utilize Conceptual Captions URLs to train a model. See open_clip to see the process of getting the dataset.

The training data directory has to be in the root of this repo, and should be structured like below.

  cc_data
    ├── train ## training image diretories.
    └── val ## validation image directories.
  cc
    ├── Train_GCC-training_output.csv ## training data list
    └── Validation_GCC-1.1.0-Validation_output.csv ## validation data list

Test Data

See README to prepare test dataset.

Training

Install dependencies

See open_clip for the details of installation. The same environment should be usable in this repo. setenv.sh is the script we used to set-up the environment in virtualenv.

Also run below to add directory to pythonpath:

. env3/bin/activate
export PYTHONPATH="$PYTHONPATH:$PWD/src"
export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning'

Pre-trained model

The model is available in GoogleDrive.

Sample running code for training:

python -u src/main.py \
    --save-frequency 1 \
    --train-data="cc/Train_GCC-training_output.csv"  \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-4 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --openai-pretrained \
    --model ViT-L/14

Sample evaluation only:

Evaluation on COCO, ImageNet, or CIRR.

python src/eval_retrieval.py \
    --openai-pretrained \
    --resume /path/to/checkpoints \
    --eval-mode $data_name \ ## replace with coco, imgnet, or cirr
    --gpu $gpu_id
    --model ViT-L/14

Evaluation on fashion-iq (shirt or dress or toptee)

python src/eval_retrieval.py \
    --openai-pretrained \
    --resume /path/to/checkpoints \
    --eval-mode fashion \
    --source $cloth_type \ ## replace with shirt or dress or toptee
    --gpu $gpu_id
    --model ViT-L/14

Demo:

Evaluation on COCO, ImageNet, or CIRR.

python src/demo.py \
    --openai-pretrained \
    --resume /path/to/checkpoints \
    --retrieval-data $data_name \ ## Choose from coco, imgnet, cirr, dress, shirt, toptee.
    --query_file "path_img1,path_img2,path_img3..." \ ## query images
    --prompts "prompt1,prompt2,..." \ #prompts. Use * to indicate the token to be replaced with an image token. e.g., "a sketch of *"
    --demo-out $path_demo \ # directory to generate html file and image directory.
    --gpu $gpu_id
    --model ViT-L/14

This demo will generate a directory which includes html file and an image directory. Download the directory and open html to see results.

Citing

If you found this repository useful, please consider citing:

@article{saito2023pic2word,
  title={Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval},
  author={Saito, Kuniaki and Sohn, Kihyuk and Zhang, Xiang and Li, Chun-Liang and Lee, Chen-Yu and Saenko, Kate and Pfister, Tomas},
  journal={CVPR},
  year={2023}
}

google-research / composed_image_retrieval