Grace Hopper 2019 Workshop Material

DS717: Prototype to Production: How to Scale your Deep Learning Model

Introduction

This repository provides the contents of a workshop given at Grace Hopper 2019.

With increasingly complex Deep Learning models and datasets, AI practitioners are faced with escalating training times, and hence lower productivity. In this interactive hands-on workshop, we will scale a prototype to production quality in 60 minutes. Starting with a popular recommender system NCF, we will explore techniques to drastically improve performance and reduce training time by about 20x.

These examples focus on scaling performance while keeping convergence consistent on a sample Deep Learning Model,using a 1x V100 16G GPU.

Refer to Slides for a brief overview.

Quick Start Guide

To run the jupyter notebook, perform the following steps using the default parameters of the NCF model.

Clone the repository.

git clone https://github.com/swethmandava/scaleDL_ghc19.git

Build the PyTorch NGC container.

bash scripts/docker/build.sh

Download and preprocess the dataset.

This repository provides scripts to download, verify and extract the ML-20m dataset.

To download, verify, and extract the required datasets, run:

bash scripts/data/e2e_dataset.sh

The script launches a Docker container with the current directory mounted and downloads the datasets to a data/ folder on the host.

Start an interactive session in the NGC container to run the hands on workshop.

After you build the container image and download the data, you can start an interactive CLI session as follows:

bash scripts/docker/launch.sh

In your web browser, open the jupyter notebook by following the instructions given in your terminal. For example, go to 127.0.0.1:8888 and use the given token. Select ncf.ipynb and run each cell.

Release notes

This repository is meant to be a learning tool to understand various computational and convergence tricks to scale your deep learning model. Refer to NCF PyTorch to achieve state of the art accuracy and performance.
This repository is not maintained. For most up-to-date Deep Learning models achieving best performance and convergence, check out NVIDIA's Deep Learning Examples.

References

Micikevicius, S. Narang, J. Alben, G. F. Diamos,E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training. CoRR, abs/1710.03740, 2017
Yang You, Igor Gitman, Boris Ginsburg. Large Batch Training of Convolutional Networks. arXiv:1708.03888
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le. Don't Decay the Learning Rate, Increase the Batch Size. arXiv:1711.00489
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, angqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet n 1 hour. arXiv preprint arXiv:1706.02677, 2017.
LARS Implementation

About

Notebooks used during Grace Hopper 2019 workshop on scaling a Deep Learning model. Explores techniques to drastically improve performance while maintaining convergence accuracy.

Languages

Language:Python 60.3%Language:Jupyter Notebook 36.9%Language:Dockerfile 1.9%Language:Shell 0.9%