ViT-Model

A reimplementation of the ViT visual model based on the architecture of a transformer orginally designed for text-base tasks.

This is my implementation of the ViT Model for practicing Pytorch

Background

ViT is a computer vision model that is built on the attention mechanism and the well-known architecture of Transformer to ensure the use of contextual information (including the position of each frame and the labels granted for the image in this case for us).

The following is the research paper: Research Paper

The official Jax repository is here.

A tensorflow2 translation also exists here, created by research scientist Junho Kim! 🙏

Structure of Source

.
├── ViT.py
├── __pycache__
│   ├── ViT.cpython-310.pyc
│   ├── data_setup.cpython-310.pyc
│   ├── data_setup.cpython-38.pyc
│   ├── engine.cpython-310.pyc
│   ├── engine.cpython-38.pyc
│   ├── helper_functions.cpython-310.pyc
│   ├── helper_functions.cpython-38.pyc
│   ├── main.cpython-310.pyc
│   ├── mlp.cpython-310.pyc
│   ├── msa.cpython-310.pyc
│   ├── patch_embedding.cpython-310.pyc
│   ├── path_embedding.cpython-310.pyc
│   └── transformer_encoder.cpython-310.pyc
├── data_setup.py
├── engine.py
├── helper_functions.py
├── main.py
├── mlp.py
├── msa.py
├── patch_embedding.py
├── train.py
└── transformer_encoder.py

Installation

$ git clone https://github.com/hkt456/ViT-Model.git
$ git cd ViT-Model

Usage

In order to get the overview of the structure of the Multihead Attention layer, Multi-layer Perceptron layer, Transformer Encoder, and the ViT model:

python3 source/main.py

For training and testing out the model, you can use data_setup for downloading the neccessary data and set up dataloaders:

from data_setup import *
get_data() # Automatically downloads a sample classification image datasets
create_dataloaders() # Returns a tuple of (train_dataloader, test_dataloader, class_names) where class_names is a list of the target classes.

Despite not having set up automatic training, there's already a template for training, testing, and evaluating the performance of the model:

from engine import *
train()
 """
    A dictionary of training and testing loss as well as training and
    testing accuracy metrics. Each metric has a value in a list for
    each epoch.
    In the form: {train_loss: [...],
                  train_acc: [...],
                  test_loss: [...],
                  test_acc: [...]}
    For example if training for epochs=2:
                 {train_loss: [2.0616, 1.0537],
                  train_acc: [0.3945, 0.3945],
                  test_loss: [1.2641, 1.5706],
                  test_acc: [0.3400, 0.2973]}
"""

There are also functions for illustrating accuravy, loss,... a lot of things. Feel free to check out helper_functions.py

Parameters

img_size: int = 224

Default value is set to 224, defining the dimensions of a 224x224 image to be processed
in_channels: int = 3

Default value is set to 3, defining the number of channels of the input to be passed into the patch_embedding layer (The patcher - Conv2D layer)
patch_size: int = 16

Default value is set to 16, defining the size of the patch to be later turned into embedding through the patch_embedding layer
number_transformer_blocks: int = 12

Default value is set to 12 to replicate the number of transformer blocks reported to be used in the architecture in the research paper
embedding_dim: int = 768

Default value is set to 768, defining the dimension of the embedding matrix used throughout different layers
mlp_size: int = 3072

Default value is set to 3072, defining the out_features for the nn.Linear layers inside the MLP layer
num_heads: int = 12

Default value is set to 12, defining the number of MultiheadAttention blocks for each MSA layer
attn_dropout: float = 0

Default value is set to 0 like in the paper, defining the dropout parameter for MultiheadAttention
mlp_dropout: float = 0.1

Default value is set to 0.1 like in the paper, defining the dropout parameter for MLPBlock
embedding_dropout: float = 0.1

Default value is set to 0.1 like in the paper to randomly drop embeddings
num_classes: int = 1000

Default value is set to 1000, defining the number of classes to classify

License

This project is licensed under the MIT License.

hkt456 / ViT-Model