K-LITE: Learning Transferable Visual Models with External Knowledge

This is the official Pytorch implementation of KLITE:

"K-LITE: Learning Transferable Visual Models with External Knowledge", NeurIPS 2022 (Oral, 1%) by

Sheng Shen*, Chunyuan Li*, Xiaowei Hu*, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach and Jianfeng Gao.

Update

Jan 17 2023. REACT extends KLITE from text knowledge from a dictionary to retrieved multimodal knoweldge from the web.

Introduction

In this paper, we propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts. In evaluation, the text is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification (IC) and object detection (OD) in ELEVATER benchmark, on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show 6.29% average improvement on 20 IC tasks and 4.2% average improvement on 13 OD tasks in performance over existing methods.

We provide two illustrative examples why K-LITE could be helpful from Oxford-Flowers and Food-101 IC tasks.

Benchmarking

UniCL training with image-label data and image-text pairs

Model	Training Set	ZS on IN-1K	ZS on 20 datasets	Download
Swin-T	IN-21K	28.5	27.1	ckpt/config
Swin-T	IN-21K + GCC-15M	46.9	39.8	ckpt/config
Swin-T	IN-21K + GCC-15M + YFCC-14M	49.3	40.5	ckpt/config
Swin-B	IN-21K + GCC-15M	50.0	39.4	ckpt/config
Swin-B	IN-21K + GCC-15M + YFCC-14M	52.3	42.5	ckpt/config

K-LITE training with image-label data and image-text pairs augmented by knowledge data

Model	Training Set	ZS on IN-1K	ZD on IN-1k (+5 GPT-3 Knowledge)	ZS on 20 datasets	ZS on 20 datasets (+5 GPT-3 Knowledge)	Download
Swin-T	IN-21K	30.5	32.0	33.5	33.8	ckpt/config
Swin-T	IN-21K + GCC-15M	49.9	51.6	41.1	42.3	ckpt/config
Swin-T	IN-21K + GCC-15M + YFCC-14M	49.6	51.9	40.3	41.6	ckpt/config
Swin-B	IN-21K + GCC-15M	52.7	55.0	42.8	43.6	ckpt/config
Swin-B	IN-21K + GCC-15M + YFCC-14M	55.8	58.0	42.5	44.8	ckpt/config

NOTE: Setting "ZS on 20 datasets" is used in the ICinW benchmark. All the above models are trained without strong data augmentations like mixup and cutmix.

Getting Started

Setup

To setup the environment, please run

pip install -r requirements.txt
pip install -e .

Note that for run main.py for potential training and evaluation, you need to install apex.

Also, see klite/load_wiki for constructing image-text pairs or image-label data (train/validation) augmented by knowledge.

Data preparation

Please following DATA.md for data preparation.

Evaluation

ImageNet Evaluation

To evaluate a pre-trained K-LITE on ImageNet val, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345 main.py --eval \
--cfg <config-file> --resume <checkpoint> --data-path <imagenet-path>  --use_knowledge

MODE: pretrain method (klite or unicl)
NGPUS: number of gpus
CFG: model config (configs/klite_swin_tiny.yaml or configs/klite_swin_base.yaml)
CKPT_DIR: directory to the ckeckpoint
IMAGENETPATH: path to ImageNet

bash scripts/run_in1k_eval.sh $MODE $NGPUS $CFG $CKPT_DIR $IMAGENETPATH

For example, to evaluate the KLITE-Swin-Tiny trained on IN-21K + GCC-15M with a single GPU:

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/klite_swin_tiny.yaml --resume ckpt/klite/in21k_gcc15m/tiny/model_state_dict.pt --data-path <imagenet-path>  --use_knowledge

20 ELEVATER Image Classification tasks Evaluation

For evaluating KLITE for downstream image classification tasks, and comparing performance on the same task suite, we include the evaluation toolkit here at klite/vision_bechmark/. Please run the setup before evalutaing on the 20 ELEVATER Image Classification tasks.

Then, to evaluate a pre-trained K-LITE on 20 ELEVATER Image Classification tasks in a zero-shot way, run:

MODE: pretrain method (klite or unicl)
CFG: model config (clip_swin_tiny or clip_swin_base)
CKPT_PATH: path to the checkpoint 
CKPT_ID: the dataset used to pretrain the model (in21k, in21k_gcc15m, in21k_gcc15m_yfcc14m)

bash scripts/run_elevater_eval.sh $MODE $CFG $CKPT_PATH $CKPT_ID

For example, to evaluate the KLITE-Swin-Tiny trained on IN-21K + GCC-15M with a single GPU:

CUDA_VISIBLE_DEVICES=0 bash  scripts/run_elevater_eval.sh klite clip_swin_tiny ckpt/klite/in21k_gcc15m_yfcc14m/tiny/model_state_dict.pt

More details for ELEVATER benchmark can be found: [Benchmark] [Toolkit] [Paper]

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@inproceedings{shen2022k,
    title={K-lite: Learning transferable visual models with external knowledge},
    author={Shen, Sheng and Li, Chunyuan and Hu, Xiaowei and Xie, Yujia and Yang, Jianwei and Zhang, Pengchuan and Rohrbach, Anna and Gan, Zhe and Wang, Lijuan and Yuan, Lu and others},
    booktitle={NeurIPS},
    year={2022}
}

Acknowledgement

Our codebase is built based on UniCL and ELEVATER.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

microsoft / klite