deep-learning django machine-learning opencv python pytorch urllib flickr8k-dataset natual-language-processing convolutional-neural-networks recurrent-neural-networks

Eye-For-Blind

Problem statement:

This problem statement is an application of both DL and NLP. This kind of model is a use-case for blind people so that they can understand any image with the help of speech.

The features of an image will be extracted by a CNN-based encoder and this will be decoded by an RNN model. The caption generated through a CNN-RNN model will be converted to speech using a text to speech library.

The project is an extended application of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention paper.

Dataset:

The dataset is taken from the Kaggle website and it consists of sentence-based image descriptions having a list of 8091 images that are each paired with five different captions which provide clear descriptions of the salient entities and events of the image. The link is: https://www.kaggle.com/adityajn105/flickr8k

Steps followed:

Data Understanding: load the data and understand the representation.
Data Preprocessing: preprocess both images and captions to the desired format.
Train-Test Split: Combine both images and captions to create the train and test dataset.
Model Building: create image captioning model by building Encoder, Attention and Decoder model.
Model Evaluation: Evaluate the models using greedy search and BLEU score.
Model testing

About

In this capstone project, we need to create a deep learning model which can explain the contents of an image in the form of speech through caption generation with an attention mechanism on Flickr8K dataset.

deep-learning django machine-learning opencv python pytorch urllib flickr8k-dataset natual-language-processing convolutional-neural-networks recurrent-neural-networks

Languages

Language:Jupyter Notebook 100.0%