Shu244 / clouDL

Automate hyperparameter tuning on Google Cloud Platform

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

This package allows you to easily train and hyperparamter tune your model on a cluster of GPU equipped VMs in GCP. This package also allows you to visualize the automatically generated reports.

Setting up

This package was developed using a Ubuntu 20.04 machine.

First download the Google Cloud SDK following the steps here and enable rest API following the steps here. Next, install this package using pip install clouDL.

Lastly, to configure your VM cluster, run clouDL_create -f PATH. This will create a folder containing the necessary configuration files in PATH.

Some typical next steps (all associated files are in the newly created folder):

  1. Create a new access_token file to clone private repos from a VM
  2. Update user_startup.sh to specify the bash script you want to execute once a VM is ready
  3. Update hyperparameters.json to search the desired hyperparameter space
  4. Create a data.tar.zip if you have training, validation, and testing data to move to the cloud
  5. Update quick_start.sh if you want to change the number of workers, project_id for GCP, or top N to archive. quick_start.sh sets some default values to get you up and running easily, but more control can be accessed by using clouDL entrypoint directly. Use clouDL -h for more.

Make sure to incorporate with the Manager class in your code when training.

Using Manager

Manager is essentially the interface you will interact with to enable training/hyperparamter tuning on a cluster of VMs.

Remember to do the following when using Manger:

  1. Set the compare and goal keys for Manager using Manager.set_compare_goal(compare, goal). This will allow Manager to compute the "best" params
  2. Start the epochs on the value returned by Manager.start_epochs() method
  3. Make sure to call Manager.finished(param_dict) once the model is done training
  4. Use Manager.save_progress(param_dict, best_param_dict) sparsely since it is expensive
  5. Manager.save_progress() can be used to track the current params and the best params
  6. Use Manager.add_progress(key, value) freely
  7. Use Manager.track_model(model) to automatically track the best params and load params when training is interrupted

Progress should be saved at the end of an epoch instead of the beginning. This is not mandatory but prevents unnecessary saving.

The key "epochs" should be used and managed via Manager.add_progress(key, value). If not used, an approximate start epoch will be calculated when resuming training, which relies on existing progress and epochs starting at 0.

Manager.py Example

For a complete example, visit here.

New Start

From clouDL_create, a quick_start.sh file is provided with four modes. The new mode does the following:

  1. Move your archived and compressed training data, access token (for VMs to access private repos), and hyperparameter configs to cloud storage
  2. Spin up a cluster of VMs, each with hardware specified by the configs.json.
  3. Run the training on VMs which manages four things: progress, results, best models, and errors.
  4. Once finish, the VMs will shut down automatically

Finished Job

The analyze mode does the following:

  1. View any errors
  2. Plot the results grouped by hyperparameter sections
  3. Plot the progress for the best model in this iteration
  4. Archive results and maintain an overall top N models
  5. Plot archive data to view top N best models
  6. Plot meta data to track trend of your hyperparameter search

Resume Hyperparameter Search

Edit your hyperparameter.json file to assign each VM a new portion of the hyperparameter grid. The available search options for a given hyperparameter are:

  1. Uniform random search
  2. Step search
  3. Multipe/exponential search
  4. Predetermined List

Then use the resume mode from quick_start.sh to update hyperparameter json and spin up a new cluster.

Manual Tests

Execute bash quick_start.sh manual project_id bucket_name to move everything to cloud storage but not spin up a cluster. This is helpful to run manual tests on the cloud.

Extras

An early stopping module is also provided to reduce boiler plate training code. The module can be accessed using

from cloudDL.earlystop import EarlyStopping

About

Automate hyperparameter tuning on Google Cloud Platform


Languages

Language:Python 95.0%Language:Shell 5.0%