cnn-for-visual-recognition lstm resnet-50 pytorch lstm-networks computer-vision

Image-Captioning-Project

In this project, a CNN-LSTM architecture is used to automatically generate images from captions (see )

After using the Microsoft Common Objects in COntext (MS COCO) dataset to train the network, new captions will be generated based on new images.

The project Structure

: containing the model architecture.
: data pre-processing and training pipeline .
: generate captions on test dataset using the trained model.

LSTMs

A very good summary on how LSTMs work can be found here

Model Architecure

CNN Encoder using Resnet

A Convolutional Neural Network (CNN) is used for the encoder part. CNN have been widely used and studied for image tasks, and are currently state-of-the art for object recognition and detection. In this case Resnet architecture was chosen. Benchamrsk fo rdiffreent CNN architectures can be found

Get Data

Clone this repo: https://github.com/cocodataset/cocoapi

git clone https://github.com/cocodataset/cocoapi.git

Setup the coco API (also described in the readme here)

cd cocoapi/PythonAPI  
make  
cd ..

Download some specific data from here: http://cocodataset.org/#download (described below)

Under Annotations, download:
- 2014 Train/Val annotations [241MB] (extract captions_train2014.json and captions_val2014.json, and place at locations cocoapi/annotations/captions_train2014.json and cocoapi/annotations/captions_val2014.json, respectively)
- 2014 Testing Image info [1MB] (extract image_info_test2014.json and place at location cocoapi/annotations/image_info_test2014.json)
Under Images, download:
- 2014 Train images [83K/13GB] (extract the train2014 folder and place at location cocoapi/images/train2014/)
- 2014 Val images [41K/6GB] (extract the val2014 folder and place at location cocoapi/images/val2014/)
- 2014 Test images [41K/6GB] (extract the test2014 folder and place at location cocoapi/images/test2014/)

References

Show and Tell: A Neural Image Caption Generator google
How to LSTM

About

This project contains a neural network architecture to automatically generate captions from images using LSTM and CNN

cnn-for-visual-recognition lstm resnet-50 pytorch lstm-networks computer-vision

MIT License

Languages

Language:Jupyter Notebook 99.0%Language:Python 1.0%