VTN - Pytorch

Implemenetation of Video Transformer Network, a simple framework for video classification task, with Vision Transformer backbone, with additional temporal transformers.

Spatial Backbone:

Visual Transformer - using timm, can be changed to any image classifier

Temporal Backbone:

Longformer - original transformer used in a paper, sample config
Linformer - another linear complexity transformer for my own research, sample config
Transformer - simple full transformer encoder, with a right configuration, model can be used as implementation of Is Space-Time Attention All You Need for Video Understanding?, sample config

Dataset implemenatations:

Basic dataset loaders for

Kinetics-400, (can be used for any Kinetics-xxx dataset)
Something-Something-V2
UCF-101

Usage

import torch
from utils import load_yaml
from model import VTN

cfg = load_yaml('configs/vtn.yaml')

model = VTN(**vars(cfg))

video = torch.rand(1, 16, 3, 224, 224)

preds = model(video) # (1, 400)

Parameters are self-explanatory in config file

Results

Model	Top-1	Top-5	Weights
Longformer-VTN	78.9%	93.7%	taken from
Transformer-VTN	78.0%	93.7%	taken from
Linformer-VTN	75.6%	92.6%	link
Linformer-VTN-MIIL-21k	76.8%	93.4%	link
Linformer-VTN-21k	77.2%	93.4%

About

Video Transformer Network

GNU General Public License v2.0

Languages

Language:Python 100.0%