shizhediao/DaVinci

Prefix Language Models are Unified Modal Learners

This is the official PyTorch implementation of the ICLR 2023 paper entitled Write and Paint: Generative Vision-Language Models are Unified Modal Learners. This repository supports pre-training on custom datasets, as well as finetuning on (1) text understanding, (2) image understanding, (3) text-to-image generation, (4) image-to-text generation, (5) multi-modal understanding tasks. Our implementation is built on the source code from ALBEF.

Hiring

We are looking for interns / FTEs at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to zhangxinsong.0320@bytedance.com.

Requirements:

Install python3 environment

pip3 install -r requirements.txt

Download raw images from corresponding websites
Download the json files we provided, which contains image read paths and captions and/or bbox annotations
If running pre-training scripts:
- install Apex
Organize these files like this:

DaVinci/
    data/
        coco_test.json
        coco_train.json
        coco_val.json
        *.json

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png

Pre-training on custom datasets:

Prepare pre-training data (json files) where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'binary': bs64_encoding_of_the_image, 'caption': text_of_image}.
In configs/Pretrain.yaml, set the paths for the json files.
Pre-train the model:

    if [[ ${NUM_WORKER_GPU} > 1 ]];
    then
        python3 -m torch.distributed.launch --nproc_per_node=${NUM_WORKER_GPU}  \
            --nnodes=${NUM_WORKER} --node_rank=${RANK_ID} --master_addr=${WORKER_0_HOST} --master_port=${WORKER_0_PORT}\
            --use_env Pretrain.py \
            --config ./configs/Pretrain.yaml \
            --output_dir ./outputs/pretrain_coco_vg_${time} \
            --override_cfg "$override_cfg"
    else
        python3 -u Pretrain.py \
        --config ./configs/Pretrain.yaml \
        --output_dir ./outputs/pretrain_coco_vg_${time} --override_cfg "$override_cfg"
    fi

Multi-Modal Understanding

VQA:

Download VQA v2 dataset and Visual Genome dataset from the original websites.
Download and extract the provided dataset json files.
In configs/VQA.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env VQA.py \
--config ./configs/VQA.yaml \
--output_dir output/vqa \
--checkpoint [Pretrained checkpoint]

Evaluate the result using the official evaluation server.

Visual Entailment:

Download SNLI-VE dataset from the original website.
Download and extract the provided dataset json files.
In configs/VE.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env VE.py \
--config ./configs/VE.yaml \
--output_dir output/VE \
--checkpoint [Pretrained checkpoint]

NLVR2:

Download NLVR2 dataset from the original website.
Download and extract the provided dataset json files.
In configs/NLVR.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env NLVR.py \
--config ./configs/NLVR.yaml \
--output_dir output/NLVR \
--checkpoint [Pretrained checkpoint]

Image-to-Text Generation (COCO Caption):

Download MSCOCO dataset from the original website.
Download and extract the provided dataset json files.
In configs/gen_coco.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env gen_coco.py \
--config ./configs/gen_coco.yaml \
--output_dir output/gen_coco \
--checkpoint [Pretrained checkpoint]

Text-to-Image Generation:

Download MSCOCO dataset from the original website.
Download and extract the provided dataset json files.
In configs/image_sampling.yaml, set the paths for the json files and the image path.
Directly generate the images:

python -m torch.distributed.launch --nproc_per_node=8 \
    --use_env image_sampling.py \
    --config ./configs/image_sampling.yaml \
    --output_dir output/image_sampling \
    --checkpoint [Pretrained checkpoint]

Text Understanding:

All GLUE datasets are provided in the Huggingface Datasets labrary, so you do not need to download them. Fine-tuning using 1 A100 GPU:

 python glue.py \
  --model_name_or_path [Pretrained checkpoint] \
  --task_name mrpc \
  --max_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_warmup_steps 50\
  --num_train_epochs 8 \
  --output_dir output/mrpc

For distributed training with multiple GPUs or nodes, please first setup huggingface accelerate library following this instruction. Then, you can do distributed training with:

 accelerate launch glue.py \
  --model_name_or_path [Pretrained checkpoint] \
  --task_name mrpc \
  --max_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_warmup_steps 50\
  --num_train_epochs 8 \
  --output_dir output/mrpc

Image Understanding

All image understanding datasets are provided by torchvision, so you do not need to download them. Fine-tuning on 8 A100 GPUs:

python image_linprobe.py \
  --pretrained [Pretrained checkpoint] \
    --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
    --override_cfg "dataset:imagenet;optimizer: {opt: adamW, lr: 1e-4, weight_decay: 0.01}"

Citation

If you use or extend our work, please consider citing our paper:

@inproceedings{diao2023write,
  title={Write and Paint: Generative Vision-Language Models are Unified Modal Learners},
  author={Diao, Shizhe and Zhou, Wangchunshu and Zhang, Xinsong and Wang, Jiawei},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023}
}

shizhediao / DaVinci