This project involved the use of a modified pre-trained ResNet50 model to map english words to the input image’s features to produce a caption through text generators (of GloVe word embedding).
It involved concepts of Image Processing, Word Embeddings, GAN modelling. It had a BLEU score as its evaluation metric.
Trained on 8kFlickr dataset. Coded using keras with tensorflow backend. ResNet50 model is used to extract features from the images.
Kaggle: https://www.kaggle.com/rishabhchaurasia7/image-captioning-on-flickr8k-dataset