Back Propagation is a way with which we update the model parameters by subtracting it's gradients so that it can converge towards the minima. This is done with the help of Chain rule.
Our aim at each each iteration of the training loop is to calculate partial derivative of the loss function with respect to each weights(parameters of the models)
After calculating the above value we update the Weights according to the following rule. \alpha is the learning rate. It shows how fast or how slow should the update be made in the direction of the minima.
We'll start with a following simple Multi Layer perceptron ,calculate its gradients and update the weights in the excel sheet step by step. Notations can be seen from the image itself. Excel sheet can be found in the repository or here
Lets say that you wanted to calculate derivative of E_total with respect to w5. The first thing we need to do is map out all the ways of reaching E_total from w5.
Therefore
Similarly,for w6,w7,w8
To calculate gradients for the weights in the first layer, let us first look at the gradient of E_total w.r.t out_h1 and out_h2
Now, to calculate gradient of w5, have a look at the following chain rule
Finally we can calculate the values of gradients of weights in the first layer, with the help of the above derivation.
These gradients were used to calculate the final value of loss for 200 iterations. The graphs of the loss for different learning rates are
Write a custom model architecture with less than 20k parameters which is able to get a validation accuracy of 99.4% with no more than 20 epochs on the MNIST Dataset
- dataloader.py : Contains code for the train and test data loaders for the MNIST digits. Various augmentation can be added here as well
- models.py : Contains different models I have experimented with. Last model - Net8 gives the best resluts, with a validation accuracy of 99.47 %. It is my best model because it consistently gives a accuracy greater than 99.3 % in the last few epochs and has a better accuracy in the first epoch than any other model
- train.py : training function
- test.py : test function :
- EVA7-Ass4.ipynb : Colab notebook for training on GPU's which imports necessaay classes/functions from above files.
- Model uses two Max-Pooling layers and applies a Global Average Pooling when the channel size becomes 3x3.
- In order to divide the model into 3 blocks, I have used padding in the first 2 layers to keep the image size constant.
- 3x3 Kernels were used for Convolutions operations and 1x1 for transition blocks.
- Net2 and Net8 give best resluts of 99.47% validation accuracy