Homework 5 - Deep Learning Frameworks - Training an image classifier on the ImageNet 2012 dataset from random weights.

This is a graded homework.

Due just before week 5 session

The goal

The goal of the homework is to train an image classification network on the ImageNet dataset to the Top 1 accuracy of 60% or higher.

We suggest that you use PyTorch or PyTorch Lightning.

The lab 6 materials ought to help you prepare for the homework.

The steps

The steps are roughly as follows:

Procure a virtual machine in AWS - we recommend a T4 GPU and 1 TB of space (e.g. g4dn.2xlarge). Use the Nvidia Deep Learning AMI so that the pre-requisites are pre-installed for you. We recommend using the latest nvidia pytorch container
Download the ImageNet dataset to your VM. Please do register at image-net.org for all of your future needs. Given the slowness of download via this web site, however, we have downloaded a copy of ImageNet for you and will distribute it in class. (FYI - some students found this link helpful for downloading)
Prepare the dataset:

create train and val subdirectories and move the train and val tar files to their respective locations
untar both files and remove them as you no longer neeed them
Use the following shell script to process your val directory. It simply moves your validation set into proper subfolders
When you untarred the train file, it created a large number (1000) of tar files, one for each class. You will need to create a separate directory for each of class , move the tar file there, untar the file and remove it. This should be a one liner shell script but we'll let you have fun with it!
Make sure that under the train and val folders, there is one directory for class and that the samples for that class are under that directory

Adapt the code we discuss in the labs to the training of imagenet. Make sure the number of classes and image sizes are correct. Make sure the transforms are correct.
Start training && observe progress !

Key decisions to consider

Which architecture to choose? Here's what Torchvision has but obviously you're not limited to that if you want to try something newer.
Which optimizer to use? For this homework we recommend SGD for simplicity.
What should the learning rate be? This is where we need to check our sources / see how others trained the model.
Should we change the learning rate while training? Our suggestion would be to use something simple: e.g. drop it 10x every 33% of training time.
When to stop training? We conscuously set the bar at 60% Top1 (on the validation set) so that you may not need to choose a very heavy model and / or train it forever.

Please note

Please do not attempt to spend more than 3 days training your model on a single T4 GPU. If your estimate gives you a longer training time, pick a different approach.
You might want to prototype your work using Jupyter and then submit it using papermill

Extra credit

Create your own model architecture. You can draw your inspiration from the PyTorch Resnet github, for instance.

To turn in

Please turn in your training logs. They should obviously display that you have achieved the Top 1 accuracy. Also, please save / download the trained weights to your jetson device for evaluation later.

Sean-Koval / w251_hw5