AndreiMoraru123 / Watch-and-Tell

PyTorch implementation of Show and Tell, adapted for video

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Watch & Tell

Show and Tell, but on video

This project has been completely reworked into an improved version and is now obsolete.

pred

Quick Jupyter mini guide on how to use the Python API

pip install pycocotools-windows

Run the make script to get the COCO dataset (50 GB, 2017 challenge) (requires gnu wget)

Download the YOLO weights (I have included the class names and config file here, but these are too big)

cd YOLO
wget https://pjreddie.com/media/files/yolov3.weights

Train:

python train.py

Run:

python run.py

Based on the original architecture (and repo), this is using ResNet-152 as the encoder, and the LSTM as the decoder

CNN Encoder RNN Decoder
image image

Using Darknet's YOLO to constrain where the model should look

@misc{https://doi.org/10.48550/arxiv.1411.4555,
  doi = {10.48550/ARXIV.1411.4555},
  url = {https://arxiv.org/abs/1411.4555},
  author = {Vinyals, Oriol and Toshev, Alexander and Bengio, Samy and Erhan, Dumitru},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Show and Tell: A Neural Image Caption Generator},
  publisher = {arXiv},
  year = {2014},
  copyright = {arXiv.org perpetual, non-exclusive lic

To be improved: ✔️ (Visit the new repo)

  • Migrate to OpenCV GPU build
  • Add an attention mechanism to the Decoder
  • Optimize model parameter size for inference speed
  • Change the greedy nearest word search to a beam search for the words in the vocabulary

About

PyTorch implementation of Show and Tell, adapted for video


Languages

Language:Python 86.5%Language:Cython 12.2%Language:Shell 1.2%Language:Makefile 0.2%