Songtuan-Lin/Captioning-ImageNet

Captioning ImageNet

This project captions the images within ImageNet dataset in a semi-autonomous way. Specifically, we used state-of-the-art caption generator plus constrained beam search algorithm to accomplish this task.

Motivation

This project is motivated by the situation that the avaliable datasets used to train image captioning model are limited to Microsoft COCO and Flickr. Therefore, we want to caption the images in ImageNet and extend the avaliable datasets.

Framework Used

The caption generator in this project is built upon Pythia, which is developed by facebook research group and provides a pre-trained BUTD caption generator. Moreover, since Pythia rely on Pytorch, this project also requires Pytorch installed.

Features

The basic feature of this project is that it can generate caption for images in ImageNet by using the method described in the project report. However, the most exciting part about this project is that it can accept almost any regular expression which specify the format of caption and implements this regular expression as constrained beam search. More precisely, this project can be used to caption arbitary image with it’s generated caption follow some constrains specified by user-defined regular expression.

Installation

In order to use this project, Pythia should first be installed:

git clone https://github.com/Songtuan-Lin/pythia.git
cd pythia/
git reset --hard 33225b89023472f9307b4e665e6429dbcbe01d77
sed -i '/torch/d' requirements.txt
pip install -e .

git clone https://gitlab.com/meetshah1995/vqa-maskrcnn-benchmark.git
cd vqa-maskrcnn-benchmark/
python setup.py build develop

If the installation failed, check whether all dependencies are installed:

pip install ninja yacs cython matplotlib demjson

Then, clone this repo and make a directory called model_data. This model_data directory is used to hold pre-trained model data:

git clone https://gitlab.cecs.anu.edu.au/u6162630/Captioning-ImageNet-Pythia.git
cd Captioning-ImageNet-Pythia/
mkdir model_data/

Finally, download pre-trained model data:

wget -O model_data/vocabulary_captioning_thresh5.txt https://dl.fbaipublicfiles.com/pythia/data/vocabulary_captioning_thresh5.txt
wget -O /model_data/detectron_model.pth https://dl.fbaipublicfiles.com/pythia/detectron_model/detectron_model.pth
wget -O model_data/butd.pth https://dl.fbaipublicfiles.com/pythia/pretrained_models/coco_captions/butd.pth
wget -O model_data/butd.yaml https://dl.fbaipublicfiles.com/pythia/pretrained_models/coco_captions/butd.yml
wget -O model_data/detectron_model.yaml https://dl.fbaipublicfiles.com/pythia/detectron_model/detectron_model.yaml
wget model_data/detectron_weights.tar.gz https://dl.fbaipublicfiles.com/pythia/data/detectron_weights.tar.gz
tar xf model_data/detectron_weights.tar.gz

Now, we are ready to go!

Usage

To caption the images in ImageNet, simply execute caption_imagenet.py file by giving it three command line arguments: the root directory of ImageNet dataset, the target directory to hold the caption results and the upper levels of ImageNet tag to trace(mentioned in project report)

python caption_imagenet.py --root_dir root directory of ImageNet --save_dir directory to save results --up_level

To support input regular expression, we also provide following classes:

utils.finite_automata.FiniteAutomata: Construct finite automata by giving an regular expression as input.
utils.table_tensor.TableTensor: Transfer transition tables of a finite automata to Pytorch tensor.
dataset.customized_dataset.CustomizedDataset: Load arbitrary dataset and transition tables which are represented as Pytorch tensor.

Additionally, regular expression should be consist of following symbols:

.: Match any single character.
?: Match zero or more occurrences of the preceding element.
( : Worked as delimiter, do not match any symbol.
): the same as (
a-zA-Z: Alphabet
space: used to seperate token, does not match any symbol.

Particularly, wildcard matching can be replaced as: (.?). Moreover, it is strongly suggest that using ‘(’ and ‘)’ to seperate each component in the regular expression. For example, if we want to input regular expression ‘dog|cat’, we strongly recommand rewrite it as ‘(dog|cat)’.

The example which demonstrate how to construct and use regular expression will be presented in Demo.

Demo

The following code snippet demonstrate how to construct finite automata and visualize it by giving a regular expression:

from utils.finite_automata import FiniteAutomata
from utils.table_tensor import TableTensor
reg = '.?(animal|bird).?'
nfa = FiniteAutomata(reg)
nfa.visualize()

The complete demo, which shows how to use our code to caption ImageNet and how to caption image with arbitrary regular expression, can be find here:

API Reference

class utils.finite_automata.FiniteAutomata (reg): This class take an input regular expression and produce the corresponding NFA. The main class methods include:

transitions(): This method returns the transition table which is corresponding to input regular expression.
visualize(): This method visualize the finite automata.

class utila.table_tensor.TableTensor(vocab, table): This class takes two arguments: a pre-defined vocabulary and a transition table produced by class FiniteAutomata. The main class methods include:

to_tensors(): Convert the transition table to Pytorch tensor.

class dataset.customized_dataset.CustomizedDataset(root_dir, transitions): This class takes a file directory and a transition table which produced by TableTensor as arguments and construct a Pytorch dataset.

Implementation Note

The core of our implementation is how we represent the transition table of a finite automata as Pytorch tensor. In our code, we represent the transition table as a list of tensors. This list contains k tensors, where k equals to the number of states in finite automata. The ith tensor in the list indicates which tokens in the vocabulary can trigger the statee transition from another states to state i. More precisely, if we denote the ith tensor in the list as Ti, then, Ti has size (num_states, vocab_size) and the transition table is interpreted as:

If Ti[j, k] = 0, then the kth token in the vocabulary can trigger the state transition from state j to state i.
If Ti[j, k] = 1, otherwise.

By representing transition table as list of tensor, we can then implement constrained beam search. This part of code has been well commented and hence will not be explained here.

Script to Download Captioned Data

Since our captioned images are stored in AWS S3 storage, which can only be accessed under certain permission, we provide a script to download captioned data:

Songtuan-Lin / Captioning-ImageNet