BridgeTower

This repo is the official Pytorch implementation of "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning".

Updates

Feb. 2023: BridgeTower was integrated into Hugging Face - Transformers.
- Model Hub, Code and Documentation are available.
- Thanks to Anahita Bhiwandiwalla, Tiep Le and Shaoyen Tseng from Intel Labs for their great work!
Nov. 2022: BridgeTower got accepted by AAAI'23. Code and checkpoints are released.
Jun. 2022: We released the preprint version in Arxiv.
May. 2022: BridgeTower (single model, 4M data) achieved 78.73% and 81.15% (base and large) on the VQAv2 Challenge test-std set.

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.

Architecture

Main Results

Deployment

Run setup.sh to set up the environment.
[Optional] We use wandb to track experiments! Please remember to wandb login and paste your token before running the script.

Dataset Preparation

We follow ViLT and use pyarrow to serialize the datasets. See here for details.
For SNLI-VE dataset, we follow here.
For VG-QA dataset, except the image-text pairs in VG got from here, image meta data, question answers data and coco split information also need to be downloaded.
The final file structure of datasets are shown in setup.sh.

Checkpoints

Pre-trained checkpoints on 4M data: BASE and LARGE
Fine-tuned checkpoints for
- Visual Question Answering on VQAv2: BASE, BASE(w/ VGQA), LARGE, LARGE(w/ VGQA)
- Image-Text Retrieval on Flickr30k: BASE
- Visual Entailment on SNLI-VE: BASE
- Visual Reasoning on NLVR$^2$: BASE
- Image-Text Retrieval on MSCOCO: BASE

Here is an example for downloading a checkpoint.

# download azcopy
wget https://aka.ms/downloadazcopy-v10-linux
tar -xvf downloadazcopy-v10-linux
sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
sudo chmod -R 777 /usr/bin/azcopy
# azcopy copy [remote path] [local path]
azcopy copy "https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A49Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D" "./BridgeTower_pt_base.ckpt"

Pre-training on Image-Text Datasets

# Pre-train BridgeTower Base Model
bash scripts/pre_train.sh
# Pre-train BridgeTower Large Model
bash scripts/pre_train_large.sh

Fine-tuning on Downstream VL Tasks

VQAv2 Evaluation needs to submit the json file in the logs/ directory to eval.ai evaluation server to get the test-dev and/or test-std scores.

# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh

# Large Model on VQAv2 without VLP
bash scripts/ftfs_large_vqa.sh

# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh

# Large Model on VQAv2 with VLP
bash scripts/ftfpt_large_vqa.sh

# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)
bash scripts/ftfpt_base_irtr_f30k.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh

# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_coco.sh

Fine-tuning on Uni-Modal Tasks

# Base Model on CIFAR with VLP
bash scripts/ftfpt_base_cifar.sh

# Base Model on GLUE with VLP
bash scripts/ftfpt_base_glue.sh

Citation

@article{xu2022bridge,
  title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
  author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
  journal={arXiv preprint arXiv:2206.08657},
  year={2022}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them:

Main Code: METER, ViLT
Others: CLIP, ALBEF, BLIP

microsoft / BridgeTower