phuocvinh143 / S2-Transformer

Repository of IJCAI 2022 paper “S2 Transformer for Image Captioning”

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

$\mathcal{S}^2$ Transformer for Image Captioning

This repository contains the official code implementation for the paper S2 Transformer for Image Captioning (IJCAI 2022).

Relationship-Sensitive Transformer

Table of Contents

Environment setup

Clone this repository and create the m2release conda environment using the environment.yml file:

conda env create -f environment.yaml
conda activate m2release

Then download spacy data by executing the following command:

python -m spacy download en_core_web_md

Note: Python 3 is required to run our code. If you suffer network problems, please download en_core_web_md library from here, unzip and place it to /your/anaconda/path/envs/m2release/lib/python*/site-packages/

Data Preparation

  • Annotation. Download the annotation file annotation.zip [1]. Extract and put it in the project root directory.
  • Feature. Download processed image features ResNeXt-101 and ResNeXt-152 features [2], put it in the project root directory.

Training

Run python train_transformer.py using the following arguments:

Argument Possible values
--exp_name Experiment name
--batch_size Batch size (default: 50)
--workers Number of workers, accelerate model training in the xe stage.
--head Number of heads (default: 8)
--resume_last If used, the training will be resumed from the last checkpoint.
--resume_best If used, the training will be resumed from the best checkpoint.
--features_path Path to visual features file (h5py)
--annotation_folder Path to annotations
--num_clusters Number of pseudo regions

For example, to train the model, run the following command:

python train_transformer.py --exp_name S2 --batch_size 50 --m 40 --head 8 --features_path /path/to/features --num_clusters 5

or just run:

bash train.sh

Note: We apply torch.distributed to train our model, you can set the worldSize in train_transformer.py to determine the number of GPUs for your training.

Evaluation

Offline Evaluation.

Run python test_transformer.py to evaluate the model using the following arguments:

python test_transformer.py --batch_size 10 --features_path /path/to/features --model_path /path/to/saved_transformer_models/ckpt --num_clusters 5

Note: We have removed the SPICE evaluation metric during training because it is time-cost. You can add it when evaluate the model: download this file and put it in /path/to/evaluation/, then uncomment codes in init.py.

We provide pretrained model here, you will get following results (second row) by evaluating the pretrained model:

Model B@1 B@4 M R C S
Our Paper (ResNext101) 81.1 39.6 29.6 59.1 133.5 23.2
Reproduced Model (ResNext101) 81.2 39.9 29.6 59.1 133.7 23.3

Online Evaluation

We also report the performance of our model on the online COCO test server with an ensemble of four S2 models. The detailed online test code can be obtained in this repo.

Reference and Citation

Reference

[1] Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[2] Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. Rstnet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15465–15474, 2021.

Citation

@inproceedings{S2,
  author    = {Pengpeng Zeng and
               Haonan Zhang and
               Jingkuan Song and 
               Lianli Gao},
  title     = {S2 Transformer for Image Captioning},
  booktitle = {IJCAI},
  % pages     = {????--????}
  year      = {2022}
}

Acknowledgements

Thanks Zhang et.al for releasing the visual features (ResNeXt-101 and ResNeXt-152). Our code implementation is also based on their repo.
Thanks for the original annotations prepared by M2 Transformer, and effective visual representation from grid-feats-vqa.

About

Repository of IJCAI 2022 paper “S2 Transformer for Image Captioning”

License:MIT License


Languages

Language:Python 100.0%Language:Shell 0.0%