Video datasets for PyTorch.

This code organizes the a standard video classification dataset into a format that is easy to use with PyTorch. We try several methods to load the Kinetics-400 dataset (241255 videos in train, 19881 videos in validations), and compare them in terms of speed and resulting size. These methods are:

The standard method, which is to store the videos using PyTorch's VideoReader class. We use this method as a baseline. This just stores the videos as they are, mp4 files.
Using torchvision.IO to load the videos. This is a library that uses the torchvision's VideoReader class and PyAV to load the videos (I was unable to make the new API work see here). This just requires the videos to be stored as mp4 files.
Using decord to load the videos. This is a library that is supposed to be faster than PyTorch's VideoReader. This just requires the videos to be stored as mp4 files.
Using the python bindings of ffmpeg to load the videos. This just requires the videos to be stored as mp4 files.
Using FFCV, we store the videos as a sequence of JPEG images. This is requires several preprocessing steps, (1) We extract the frames from the videos, (2) We resize the frames to a maximum size, (3) We encode the frames as JPEG images, (4) We store the frames in a single FFCV file.

Since videos are most of the time redundat, people usually skip frames when giving them to a neural network. This is ussally denoted as Txt, where T is the number of frames the model sees and t is the number of frames that are skipped in between. We use T=16 and t=4, which is the standard for Kinetics-400. Similarly, for evaluation people usually try to take full coverage of the video, so they split the video into several crops (this can be overlapin or non-overlaping) and average the predictions. Crops are denoted as sxτ, where s is the number of spatial crops and τ is the number of time crops. We use 5 time crops an 1 spatial crop, there is no standard and this varies from paper to paper.

Requirements

Download the Kinetics dataset from https://github.com/cvdfoundation/kinetics-dataset
Move and extract the training and validation videos to labeled subfolders, using the following shell script
Create the FFCV file using the following code:

python write_video.py --cfg.dataset k400 --cfg.split training --cfg.data_dir /scratch/jeh16/datasets/k400/ --cfg.write_path /scratch/jeh16/ffcv/k400/k400_train_16x4.ffcv --cfg.min_resolution 320 --cfg.quality 90 --cfg.max_frames 16 --cfg.frame_skip 4 --cfg.num_workers 50

python write_video.py --cfg.dataset k400 --cfg.split validation --cfg.data_dir /scratch/jeh16/datasets/k400/ --cfg.write_path /scratch/jeh16/ffcv/k400/k400_val_16x4_1x5.ffcv --cfg.min_resolution 320 --cfg.quality 90 --cfg.max_frames 16 --cfg.frame_skip 4 --cfg.num_workers 50

Video loading

We simulate the video loading process by loading the videos and skipping frames. We always perform 10 measurements and report the average. We perform the following standard augmentations RamdomResizedCrop->RandomHorizontalFlip->Normalize. We use the following parameters: T=16, t=4, s=1, τ=5. We use a batch size of 32 ans set the numer of workers to 40. We use a single NVIDIA A40 GPU.

We use the following code to load the videos using the standard method:

python kinetics_legacy.py

We use the following code to load the videos using torchvision.IO:

python kinetics_pyav.py

We use the following code to load the videos using decord:

python kinetics_decord.py

We use the following code to load the videos using ffmpeg:

python kinetics_ffmpeg.py

We use the following code to load the videos using FFCV:

python kinetics_ffcv.py

Speed Results

Library	Train Size (GB)	Val Size (GB)	Num_train	Num_val	Frames	Skip	videos/Second
VideClips (Torchvision)	348	29	241255	19881	16	4	276.55
Torchvision.IO (PyAV)	348	29	241255	19881	16	4	42.36
Decord	348	29	241255	19881	16	4	70.75
FFmpeg	348	29	241255	19881	16	4	26.91
FFCV	946	44	241255	19881	16	4	564.03

Model evaluations using FFCV:

The models on the HUB that I found are:

VideoMAE on the following configurations: Small, Base, Large, Huge. Trained using 16 frames and skipping 4.
My own ViC-MAE on the following configurations: Base, Large. See the ViC-MAE repository for more details. Trained using 16 frames and skipping 4.
TimeSformer on the following configurations: Base, HR. Trained using 8 frames and skipping 4 and 16 frames and skipping 4, respectively.
ViViT on the following configurations: Base. Trained using 32 frames and skipping 2.

The results we obtain are:

Name	Arch	Fxs	SxT views	Accuracy	SxT views	Accuracy
VideoMAE	ViT/S-16	16x4	3x5	79	1x5	54.7
VideoMAE	ViT/B-16	16x4	3x5	81.5	1x5	77.7
VideoMAE	ViT/L-14	16x4	3x5	85.2	1x5	81.9
VideoMAE	ViT/H-14	16x4	3x5	86.6	1x5	83
ViCMAE	ViT/B-16	16x4	3x7	80.8	1x5	77.3
ViCMAE	ViT/L-14	16x4	3x7	86.8	1x5	83.1
TimesFormer	Base	8x4	3x1	79.1	1x5	73.9
TimesFormer	HR	16x4	3x1	81.8	1x5	64.3
ViViT	ViViT/B	32x2	1x4	79.9	1x5	66.5

jeffhernandez1995 / video-loading

Video datasets for PyTorch.

Requirements

Video loading

Speed Results

Model evaluations using FFCV:

About

Languages