Joint Event Detection and Description in Continuous Video Streams

Code released by Huijuan Xu (Boston University).

Introduction

We present the Joint Event Detection and Description Network (JEDDi-Net) that solves the dense captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features, and transcribes the event proposals into captions with the consideration of visual and language context.

License

JEDDi-Net is released under the MIT License (refer to the LICENSE file for details).

Citing JEDDi-Net

If you find JEDDi-Net useful in your research, please consider citing:

@article{xu2019joint,
title={Joint Event Detection and Description in Continuous Video Streams},
author={Xu, Huijuan and Li, Boyang and Ramanishka, Vasili and Sigal, Leonid and Saenko, Kate},
journal={2019 IEEE Winter Conference on Applications of Computer Vision (WACV)},
year={2019}
}

Installation:

Clone the JEDDi-Net repository.

git clone --recursive git@github.com:VisionLearningGroup/JEDDi-Net.git

Build Caffe3d with pycaffe (see: Caffe installation instructions).

Note: Caffe must be built with Python support!

cd ./caffe3d

# If have all of the requirements installed and your Makefile.config in place, then simply do:
make -j8 && make pycaffe

Build JEDDi-Net lib folder.
```
cd ./lib    
make
```

Preparation:

Download the ground truth annatations and videos in ActivityNet Captions dataset.
Extract frames from downloaded videos in 25 fps.

Generate the pickle data for training and testing JEDDi-Net model.

cd ./preprocess
# generate training data
python generate_train_roidb_sorted.py
# generate validation data
python generate_val_roidb.py

Training:

Download the separately-trained segment proposal network(SPN) and captioning models ./pretrain/ .

In JEDDi-Net root folder, run:

bash ./experiments/denseCap_jeddiNet_end2end/script_train.sh

Testing:

Download one sample JEDDi-Net model to ./snapshot/ .

One JEDDi-Net model on ActivityNet Captions dataset is provided in: caffemodel .

The provided JEDDi-Net model has the METEOR score ~8.58% on the validation set.
In JEDDi-Net root folder, generate the prediction log file on the validation set.
```
bash ./experiments/denseCap_jeddiNet_end2end/test/script_test.sh 
```

Generate the results.json file from the prediction log file.

cd ./experiments/denseCap_jeddiNet_end2end/test/
bash bash.sh

Follow the evaluation code to get the evaluation results.

About

Implementation for "Joint Event Detection and Description in Continuous Video Streams"

MIT License

Languages

Language:Jupyter Notebook 56.6%Language:C++ 33.1%Language:Python 5.4%Language:Cuda 2.6%Language:CMake 1.1%Language:Shell 0.4%Language:MATLAB 0.4%Language:Makefile 0.3%Language:CSS 0.1%Language:HTML 0.1%Language:Dockerfile 0.0%

VisionLearningGroup / JEDDi-Net