Src: Google Blog
The main Transformer encoder from An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
My Model on Hugging Face
I have done the training for my ViT on Food101 Dataset and deployed model can be used from here .