wongamanda / image-captioning

A deep learning model to generate captions for Flickr images

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Image Captioning with Azure ML

A deep learning model to generate automatic descriptive captions for Flickr images

Architecture

Architecture diagram

The workflow for this project consists of Azure Blob Storage, Azure ML, and Azure Computer Vision API. "Cloud architecture for Image Captioning"

Neural network architecture

Below is the general architecture for how we will build the deep learning model based on captions and images. We will utilize transfer learning and sequence models to generate captions. "Neural network architecture"

Project Plan

Objectives:

  • Build a supervised deep learning model that can create alt-text captions for images
  • Train different models and select the one with the highest accuracy to compare against the caption generated by the Cognitive Services Computer Vision API

Output and success metrics:

  • Generate a short caption for an image randomly selected from the test dataset and compare it to the caption from the Computer Vision API output
  • High accuracy rate in predicting captions for images and Bleu score

About the data:

  • Flickr30k dataset (hosted on Kaggle) with roughly 30k images in JPEG format with over 158k captions. It has not been split into pre-defined training and test sets.
  • There are 5 different captions for the same image

Modeling techniques:

  • Transfer learning using Keras VGG16 or Inceptionv3 and RNN model (LSTM or GRU) to sequence over natural-language image captions
  • Pre-trained word vectors through GloVe

Execution stages:

  1. Prepare data
    1. Download and store data in blob storage
    2. Clean captions data
    3. Build a list of images and corresponding captions (i.e., image-input and text-output)
    4. Split data into training and validation sets
  2. Create vocabulary from the training dataset
    1. Preprocess captions
    2. Get unique words from all image captions
    3. Load pretrained word embeddings (GloVe)
    4. Tokenize captions into Tensorflow records - insert end of sentence tokens, etc.
  3. Use images to train a model
    1. Get data from blob storage
    2. Extract features from photos using VGG model
    3. Pass images as vectors through the RNN Decoder
  4. Predict captions using trained model
  5. Test model on validation data and measure accuracy
  6. (if time): compare predicted image captions from the model to captions created by Cognitive Services API

Software Frameworks:

  • Python
  • Keras, Tensorflow
  • Skikit-learn and matplotlib
  • GloVe
  • Pandas

About

A deep learning model to generate captions for Flickr images

License:Apache License 2.0


Languages

Language:Jupyter Notebook 100.0%