Vision Transformers

Implementation of Vision Transformer in PyTorch, a new model to achieve SOTA in vision classification with using transformer style encoders. Associated blog article.

Features

Current Support for:

Vanilla ViT
Hybrid ViT (with support for BiT-style resnets)

To Do:

Hybrid ViT (with support for AxialResNets as backbone)
Full Axial-ViT
Training Script

Citations

@inproceedings{
    anonymous2021an,
    title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author={Anonymous},
    booktitle={Submitted to International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=YicbFdNTTy},
    note={under review}
}

About

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

https://openreview.net/forum?id=YicbFdNTTy

MIT License

Languages

Language:Python 100.0%