nooralahzadeh / ViT

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Home Page:https://openreview.net/forum?id=YicbFdNTTy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vision Transformers

Implementation of Vision Transformer in PyTorch, a new model to achieve SOTA in vision classification with using transformer style encoders. Associated blog article.

ViT

Features

Current Support for:

  • Vanilla ViT
  • Hybrid ViT (with support for BiT-style resnets)

To Do:

  • Hybrid ViT (with support for AxialResNets as backbone)
  • Full Axial-ViT
  • Training Script

Citations

@inproceedings{
    anonymous2021an,
    title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author={Anonymous},
    booktitle={Submitted to International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=YicbFdNTTy},
    note={under review}
}

About

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

https://openreview.net/forum?id=YicbFdNTTy

License:MIT License


Languages

Language:Python 100.0%