Vision Transformer (ViT)

Tensorflow implementation of the Vision Transformer (ViT) presented in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, where the authors show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification.

Install dependencies

Create a Python 3 virtual environment and activate it:

virtualenv -p python3 venv
source ./venv/bin/activate

Next, install the required dependencies:

pip install -r requirements.txt

Train model

Start the model training by running:

python train.py --logdir path/to/log/dir

To track metrics, start Tensorboard

tensorboard --logdir path/to/log/dir

and then go to localhost:6006.

Citation

@inproceedings{
    anonymous2021an,
    title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author={Anonymous},
    booktitle={Submitted to International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=YicbFdNTTy},
    note={under review}
}

About

Tensorflow implementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

https://openreview.net/pdf?id=YicbFdNTTy

Languages

Language:Python 100.0%