facebookresearch / pytorchvideo

A deep learning library for video understanding research.

Home Page:https://pytorchvideo.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RoI head for Vision Transformer

yuxin212 opened this issue Β· comments

πŸš€ Feature

Please consider adding RoI head for Vision Transformer, which can be used for action detection using Vision Transformer.

Motivation

Performance of MViT on the AVA dataset is better than methods based on conv nets, like Slow/ResNet. But currently there are only implementation of RoI heads for Slow and SlowFast.

Pitch

A function/class similar to the ResNet RoI head, creates the RoI head for Vision Transformer.

Took a brief look at this.

I think we could use the RoI code found in here

There are some differences but I think it's a good starting point. Curious to know your thoughts!