Watch & Tell

Show and Tell, but on video

This project has been completely reworked into an improved version and is now obsolete.

Join the dark side here on the new repository

Quick Jupyter mini guide on how to use the Python API

pip install pycocotools-windows

Run the make script to get the COCO dataset (50 GB, 2017 challenge) (requires gnu wget)

Download the YOLO weights (I have included the class names and config file here, but these are too big)

cd YOLO
wget https://pjreddie.com/media/files/yolov3.weights

Train:

python train.py

Run:

python run.py

Based on the original architecture (and repo), this is using ResNet-152 as the encoder, and the LSTM as the decoder

CNN Encoder	RNN Decoder

Using Darknet's YOLO to constrain where the model should look

@misc{https://doi.org/10.48550/arxiv.1411.4555,
  doi = {10.48550/ARXIV.1411.4555},
  url = {https://arxiv.org/abs/1411.4555},
  author = {Vinyals, Oriol and Toshev, Alexander and Bengio, Samy and Erhan, Dumitru},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Show and Tell: A Neural Image Caption Generator},
  publisher = {arXiv},
  year = {2014},
  copyright = {arXiv.org perpetual, non-exclusive lic

To be improved: ✔️ (Visit the new repo)

Migrate to OpenCV GPU build
Add an attention mechanism to the Decoder
Optimize model parameter size for inference speed
Change the greedy nearest word search to a beam search for the words in the vocabulary

About

PyTorch implementation of Show and Tell, adapted for video

Languages

Language:Python 86.5%Language:Cython 12.2%Language:Shell 1.2%Language:Makefile 0.2%