mesnico / TERAN

Code and Resources for the Transformer Encoder Reasoning and Alignment Network (TERAN), accepted for publication in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Transformer Encoder Reasoning and Alignment Network (TERAN)

Updates

  • 🔥 09/2022: The extension to this work (ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval) has been published in proceedings of CBMI 2022. Check out code and paper!

Introduction

Code for the cross-modal visual-linguistic retrieval method from "Fine-grained Visual Textual Alignment for Cross-modal Retrieval using Transformer Encoders", accepted for publication in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) [Pre-print PDF].

This work is an extension to our previous approach TERN accepted at ICPR 2020.

This repo is built on top of VSE++ and TERN.

Fine-grained Alignment for Precise Matching

Retrieval

Setup

  1. Clone the repo and move into it:
git clone https://github.com/mesnico/TERAN
cd TERAN
  1. Setup python environment using conda:
conda env create --file environment.yml
conda activate teran
export PYTHONPATH=.

Get the data

Data and pretrained models be downloaded from this OneDrive link (see the steps below to understand which files you need):

  1. Download and extract the data folder, containing annotations, the splits by Karpathy et al. and ROUGEL - SPICE precomputed relevances for both COCO and Flickr30K datasets. Extract it:
tar -xvf data.tgz
  1. Download the bottom-up features for both COCO and Flickr30K. We use the code by Anderson et al. for extracting them. The following command extracts them under data/coco/ and data/f30k/. If you prefer another location, be sure to adjust the configuration file accordingly.
# for MS-COCO
tar -xvf features_36_coco.tgz -C data/coco

# for Flickr30k
tar -xvf features_36_f30k.tgz -C data/f30k

Evaluate

Extract our pre-trained TERAN models:

tar -xvf TERAN_pretrained_models.tgz

Then, issue the following commands for evaluating a given model on the 1k (5fold cross-validation) or 5k test sets.

python3 test.py pretrained_models/[model].pth --size 1k
python3 test.py pretrained_models/[model].pth --size 5k

Please note that if you changed some default paths (e.g. features are in another folder than data/coco/features_36), you will need to use the --config option and provide the corresponding yaml configuration file containing the right paths.

Train

In order to train the model using a given TERAN configuration, issue the following command:

python3 train.py --config configs/[config].yaml --logger_name runs/teran

runs/teran is where the output files (tensorboard logs, checkpoints) will be stored during this training session.

Visualization

WIP

Reference

If you found this code useful, please cite the following paper:

@article{messina2021fine,
  title={Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders},
  author={Messina, Nicola and Amato, Giuseppe and Esuli, Andrea and Falchi, Fabrizio and Gennaro, Claudio and Marchand-Maillet, St{\'e}phane},
  journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
  volume={17},
  number={4},
  pages={1--23},
  year={2021},
  publisher={ACM New York, NY}
}

License

Apache License 2.0

About

Code and Resources for the Transformer Encoder Reasoning and Alignment Network (TERAN), accepted for publication in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

License:Apache License 2.0


Languages

Language:Python 99.6%Language:Shell 0.4%