Image Caption and Summary Generation

The project aims at generating summary or caption for a given image.

There are 2 important parts of the project. The first part is the image model where the model architecture outputs the image features as non-linear activations of pixel values of the image.
These values are fed into the second part i.e. the language model which generates the summary sentence based on the output of the image model.

The dataset used for the project is MSCOCO dataset. You can download the trianing dataset by clicking here (Please note that the dataset is around 19.3 GB).
You also need the captions for them and they can be downloaded by clicking here.
Both of them are downloaded from the official cocodataset website.

You need to extract them and keep them ready before starting.

We have also used Python API for using coco dataset. Please install by following the procedures mentioned for PythonAPI from here

Progress

To save the training time the activations from image model is generated pre hand because we are using trained model as an image model and not meddling with the weights of the model.

So after cloning the project run -