This is the official code release for R3-Transformer proposed in Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language.
All dependencies are included in the original model's container. First install the latest docker. Then pull our docker image by:
docker pull hassanhub/vid_cap:latest
Then run the container by:
docker run --gpus all --name r3_container -it -v /home/
Note: This image already includes CUDA-related drivers and dependencies.
Alternatively, you can create your own environment and make sure the following dependencies are installed:
Python 3.7/3.8Tensorflow 2.3CUDA 10.1NVIDIA Driver v 440.100CuDNN 7.6.5opencv-pythonh5pytransformersmatplotlibscikit-imagenvidia-ml-py3decordpandastensorcore.dataflow
In order to speed-up data infeed, we utilize a multi-chunk hdf5 format. There are two options for getting data prepared for train/evaluation.
Download pre-extracted features using SlowFast-50-8x8 pre-trained on Kinetics 400 from this link:
- Parts 0-10 (coming soon...)
Alternatively, you can follow these steps to extract a customized version of features using your own visual backbone:
- Download YouCook II
- Download ActivityNet Captions
- Pre-process raw video files using this script
- Extract visual features using your visual backbone or our pre-trained SlowFast-50-8x8 using this script
- Store features and captions in a multi-chunk hdf5 format using this script