Advanced Image Captioning Using Visual-Semantic Attention

This moodel is modified version for image captioning using bottopm-up and top-down attention model . We added semantic attention to existing model. To do this, first we implemented attribute predictor based on canonical correlation analysis. In our attribute predictor, it generate 5 attributes for gevin image. Using trained attribute predictor, we created attribute data for train and validation images of MSCOCO. Then, preprocessed the attribute data and used it for training model. To utilize the attribute information in captioning model, we added top-down semantic attention LSTM similar to top-down object attention LSTM. And also added semantic attention for 5 attributes and it used in language model by concatenating with object attention layer. The quantitative results are as foloows.

Results obtained

Our experiment environment is Colab. So, we can use only one GPU for training and evaluating of model. Therefore, the evaluation epoche of ours and baseline we implemented are smaller than exisiting repository implementation (You can download the checkpoints for our model and baseline model we evaluated by clicking the name of model in the below table).

Model	BLEU-4	METEOR	ROUGE-L	CIDEr
Original paper implementation	36.2	27.0	56.4	113.5
Reference github implementation (36 epoch)	35.9	26.9	56.2	111.5
Our implementation (10 epoch)	35.4	26.3	55.3	108.3
Reference github implementation (10 epoch)	34.7	26.1	54.3	107.3

Requirements

python 3.6
torch 0.4.1
h5py 2.8
tqdm 4.26
nltk 3.3

Data preparation

Create a folder called 'data' Create a folder called 'preprocessed_data'

Download the MSCOCO Training (13GB) and Validation (6GB) images.

Also download Andrej Karpathy's training, validation, and test splits. This zip file contains the captions.

Unzip all files and place the folders in 'data' folder.

Next, download the bottom up image features.

Unzip the folder and place unzipped folder in 'bottom_up_features' folder.

Next type below command in a python 2 environment:

python bottom_up_features/tsv.py

This command will create the following files -

An HDF5 file containing the bottom up image features for train and val splits, 36 per image for each split, in an I, 36, 2048 tensor where I is the number of images in the split.
PKL files that contain training and validation image IDs mapping to index in HDF5 dataset created above.

Move these files to the folder 'preprocessed_data' (See 'bottom_up_features/README.md' file for more details).

Next, you shoud create attribute data before preprcessing the dataset. To do this, follow the instruction of README file in the attribute_predictore folder.

Next, follows below ipynb file to preprocess the data:

data_preprocessing.ipynb

or, type this command:

python create_input_files.py (*check your path)

This ipynb file (or command) will create the json files for caption, caption length, attributes, bottom up image features and url for each image and will be stored in preprocessed_data folder to train and evaluate the model

Next, go to nlg_eval_master folder and type the following two commands:

pip install -e .
nlg-eval --setup

This will install all the files needed for evaluation.

Training

To train the bottom-up top down model from scratch, type:

python model_train.py <CHECKPOINT_PATH>  (*If you have no checkpoint, write None)

or follows below file

model_training.ipynb (*To configure checkpoint paht, see the models/model_parameters.py)

Evaluation

To evaluate the model on the karpathy test split, edit the eval.py file to include the model checkpoint location and then type:

python model_evaluation.py

or follows below file

model_evaluation.ipynb  <CHECKPOINT_PATH>

Make prediction for test image

To make a prediction for one test image, you can get it by following belew file:

caption_prediction.ipynb

SeunghoHan / cs470_tp