image-captioning object-detection yolov3 encoder-decoder-model gru

Image-Captioning-via-YOLOv5-EncoderDecoderwithAttention

Archived : This code was just a fun project! Neither it was propely tuned nor it is properly maintained! No Updates/correction expected! Focusing on other areas, I am not a vision expert. This code runs properly for most of the people, please check if images are getting populated properly for you if there is an error!**

Use original Flickr8K dataset. - https://www.kaggle.com/datasets/adityajn105/flickr8k

PUT 'Images' directory and 'captions.txt' in the same directory as in root of this repo.

Attempt for Image Captioning using combination of object detection via YOLOv5 and Encoder Decoder LSTM model on Flickr8K dataset.

Run to make object crops via YOLOv5

python detect_object.py

Run to train - This just takes the Resnet embeddngs of object cropped images detected not any kind of text from YOLO labeller.

python train.py True

To evaluate on validation data

python train.py False

Sample predictions -

references -  [This is a black dog splashing in the water, A black lab with tags frolicks in the water ,A black dog running in the surf,The black dog runs through the water]

prediction- [['<SOS>'], ['a'], ['black'], ['dog'], ['is'], ['a'], ['a'], ['water'], ['.'], ['<EOS>']]

references -  [A black dog and a spotted dog are fighting, A black dog and a tri-colored dog playing with each other on the road,
A black dog and a white dog with brown spots are staring at each other in the street,Two dogs of different breeds looking at each other on the road]

prediction- [['<SOS>'], ['a'], ['black'], ['and'], ['white'], ['dog'], ['is'], ['running'], ['through'], ['a'], ['.'], ['<EOS>']]

Mean BLEU-4 score on validation data is quite low. Suggested Improvements : Use Adam and shuffling of data. Maybe minibatching. Increasing number of datapoints combining other datasets since 8k is quite low smaple size (No of parameters >> no of datapoints, not ideal for neural nets).

Citation (Flickr8K Dataset)

Hodosh, Micah, Peter Young, and Julia Hockenmaier. "Framing image description as a ranking task: Data, models and evaluation metrics." Journal of Artificial Intelligence Research 47 (2013): 853-899.

About

Image Captioning using combination of object detection via YOLOv5 and Encoder Decoder LSTM model

image-captioning object-detection yolov3 encoder-decoder-model gru

GNU General Public License v3.0

Languages

Language:Python 100.0%