RoI head for Vision Transformer
yuxin212 opened this issue Β· comments
π Feature
Please consider adding RoI head for Vision Transformer, which can be used for action detection using Vision Transformer.
Motivation
Performance of MViT on the AVA dataset is better than methods based on conv nets, like Slow/ResNet. But currently there are only implementation of RoI heads for Slow and SlowFast.
Pitch
A function/class similar to the ResNet RoI head, creates the RoI head for Vision Transformer.
Took a brief look at this.
I think we could use the RoI code found in here
There are some differences but I think it's a good starting point. Curious to know your thoughts!