Shu244/clouDL

Overview

This package allows you to easily train and hyperparamter tune your model on a cluster of GPU equipped VMs in GCP. This package also allows you to visualize the automatically generated reports.

Setting up

This package was developed using a Ubuntu 20.04 machine.

First download the Google Cloud SDK following the steps here and enable rest API following the steps here. Next, install this package using pip install clouDL.

Lastly, to configure your VM cluster, run clouDL_create -f PATH. This will create a folder containing the necessary configuration files in PATH.

Some typical next steps (all associated files are in the newly created folder):

Create a new access_token file to clone private repos from a VM
Update user_startup.sh to specify the bash script you want to execute once a VM is ready
Update hyperparameters.json to search the desired hyperparameter space
Create a data.tar.zip if you have training, validation, and testing data to move to the cloud
Update quick_start.sh if you want to change the number of workers, project_id for GCP, or top N to archive. quick_start.sh sets some default values to get you up and running easily, but more control can be accessed by using clouDL entrypoint directly. Use clouDL -h for more.

Make sure to incorporate with the Manager class in your code when training.

Using Manager

Manager is essentially the interface you will interact with to enable training/hyperparamter tuning on a cluster of VMs.

Remember to do the following when using Manger:

Set the compare and goal keys for Manager using Manager.set_compare_goal(compare, goal). This will allow Manager to compute the "best" params
Start the epochs on the value returned by Manager.start_epochs() method
Make sure to call Manager.finished(param_dict) once the model is done training
Use Manager.save_progress(param_dict, best_param_dict) sparsely since it is expensive
Manager.save_progress() can be used to track the current params and the best params
Use Manager.add_progress(key, value) freely
Use Manager.track_model(model) to automatically track the best params and load params when training is interrupted

Progress should be saved at the end of an epoch instead of the beginning. This is not mandatory but prevents unnecessary saving.

The key "epochs" should be used and managed via Manager.add_progress(key, value). If not used, an approximate start epoch will be calculated when resuming training, which relies on existing progress and epochs starting at 0.

Manager.py Example

For a complete example, visit here.

New Start

From clouDL_create, a quick_start.sh file is provided with four modes. The new mode does the following:

Move your archived and compressed training data, access token (for VMs to access private repos), and hyperparameter configs to cloud storage
Spin up a cluster of VMs, each with hardware specified by the configs.json.
Run the training on VMs which manages four things: progress, results, best models, and errors.
Once finish, the VMs will shut down automatically

Finished Job

The analyze mode does the following:

View any errors
Plot the results grouped by hyperparameter sections
Plot the progress for the best model in this iteration
Archive results and maintain an overall top N models
Plot archive data to view top N best models
Plot meta data to track trend of your hyperparameter search

Resume Hyperparameter Search

Edit your hyperparameter.json file to assign each VM a new portion of the hyperparameter grid. The available search options for a given hyperparameter are:

Uniform random search
Step search
Multipe/exponential search
Predetermined List

Then use the resume mode from quick_start.sh to update hyperparameter json and spin up a new cluster.

Manual Tests

Execute bash quick_start.sh manual project_id bucket_name to move everything to cloud storage but not spin up a cluster. This is helpful to run manual tests on the cloud.

Extras

An early stopping module is also provided to reduce boiler plate training code. The module can be accessed using

from cloudDL.earlystop import EarlyStopping

Shu244 / clouDL