Oktafsurya / NDCV-Image-Captioning

My work for Udacity Computer Vision Nano Degree.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NDCV-Image-Captioning

In this project, we will create a neural network architecture to automatically generate captions from images. We used MS COCO Datasets for captioning task.

Model Architecture

The model consists of 2 parts: the encoder and decoder.

1. Encoder

The encoder that we provide to you uses the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images. The output is then flattened to a vector, before being passed through a Linear layer to transform the feature vector to have the same size as the word embedding.

2. Decoder

There are 3 layers in our decoder: embedding layer, LSTM layer and linear layer. Initially the encoder output features will be fed to the decoder embedding layer then the results from the embedding layer will be fed to the LSTM. We will use the teacher forcer method to train LSTM where at t = 1 we use the features from the encoder, and at t = 2,3,4 and so on we use the word from the groundtruth caption as input to the LSTM.

Inference Result

References

  1. Show and Tell: A Neural Image Caption Generator

  2. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

About

My work for Udacity Computer Vision Nano Degree.


Languages

Language:Jupyter Notebook 99.0%Language:Python 1.0%