aws-samples / sagemaker-horovod-distributed-training

Distributed training with SageMaker's script mode using Horovod distributed deep learning framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sagemaker Distributed Training with Parameter Server and Horovod

Distributed training with SageMaker's script mode using Parameter Server and Horovod

This lab demonstrates two concepts on a simple MNIST dataset and a TensorFlow deep learning framework:

  • SageMaker distributed training with Parameter Server
  • SageMaker distributed training with Horovod Both labs use SageMaker's "script mode" which allows model's training script to be used as an entry point to SageMaker API. Thus, one can bring into SageMaker existing model scripts without making any modifications to them.

A. SageMaker's Distributed Training Framework based on Parameter Servers

Parameter Server mode allows one to distribute training workload accross multiple compute instances by implementing a parameter server on each of compute instance to handle model's gradient increment averaging and distributing updated model weights. SageMaker implements an asynchronous parameter server mode whereby gradients and model weights of each worker are updated individually, without waiting for the rest of workers to complete their SGD step. One parameter server is instantiated on each compute instance. For example, for train_instance_count = N, you will have total instances = N, ParameterServer = N,Workers = N-1 (one of the instances becomes a Master and does not carry SGD computations). Currenly, there is no provisions to instantiate multiple workers on a compute instance, so there is a limited advantage of selecting compute instances with a large number of compute cores. However, some deep learning frameworks (eg. TensorFlow) have ability to spread their workload accross multiple cores within the same compute instance, in which case selecting a CPU instance with 8+ cores is advisable. Please, note that Parameter Server is implemented on CPU cores, while workers can be implemented on either CPUs or GPUs.

B. SageMaker's Horovod Distributed Training Framework

What is Horovod? It is a framework allowing a user to distribute a deep learning workload among multiple compute nodes and take advantage of inherent parallelism of deep learning training process. It is available for both CPU and GPU AWS compute instances. Horovod follows the Message Passing Interface (MPI) model in All-Reduce fashion. This is a popular standard for passing messages and managing communication between nodes in a high-performance distributed computing environment.

C. Further improving performance by co-locating compute instances.

To reduce network delays between compute nodes, it is advisable to locate them within the same subnet. We will demonstrate how to deploy subnets withing the same VPC and assign compute instances to to them\

In this lab, we will be instantiating CPU compute nodes for simplicity and scalability.

Steps for launching Jupyter Notebook:

Select one of the following AWS Regions in your AWS console:

Navigate to Sagemaker Service

Open SageMaker Console by clicking on "Services" and searching for Sagemaker

Navigate to Sagemaker Service

Navigate to SageMaker Notebooks

Navigate to Sagemaker Notebooks

Create a SageMaker Notebook Instance

Creae Sagemaker Notebooks

Give the SageMaker Notebook Instance a name (note that '_' are not allowed).

From the drop-down menu, select a compute instance type for your Jupyter Notebook. We recommend 'ml.t2.medium' for lowest cost.

Name Sagemaker Notebooks

In 'Permissions and encryption' pane click on "Create a new role".

Name Sagemaker Notebooks

Select "Any S3 bucket" and click on "Create role"

Create IAM role for Sagemaker

You will see a newly created IAM SageMaker role

Create IAM role for Sagemaker

We now need to add a few more security policies to our newly created IAM SageMaker role.

Click on newly created IAM SageMaker role

Click on "Attach Policies" button

Create IAM role for Sagemaker

Search for "EC2Container" and add AmazonEC2ContanerRegistryFullAccess policy (click on the radio button to the left)

Create IAM role for Sagemaker

Search for "VPC" and add AmazonVPCAccess policy (click on the radio button to the left)

Create IAM role for Sagemaker

Click on "Attach Policies" button. Your policy list for the SageMaker IAM role should look like this:

Create IAM role for Sagemaker

We need a custom policy to allow full access to CloudFormation service.

"Add in-line policy". Select JSON tab and paste the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "cloudformation:*",
            "Resource": "*"
        }
    ]
}

Create IAM role for Sagemaker

Give your policy a name and click on "Create Policy"

Create IAM role for Sagemaker

Your policy list for the SageMaker IAM role should look like this:

Create IAM role for Sagemaker

Switch to the previous browser tab with SageMaker Notebook (you may close browser tab with IAM service), scroll down to "Git Repositories", click on "Repository" and select "Add Repository to Amazon SageMaker"

Create IAM role for Sagemaker

Select 'GitHub repository icon', give it a name, past repo's URL (https://github.com/aws-samples/sagemaker-horovod-distributed-training) and click on "Add repository"

Create IAM role for Sagemaker

If successful, you will see the Git repo listed under SageMaker Git Repositories:

Create IAM role for Sagemaker

You can now close this browser's tab and go back to the previous tab where we were creating the notebook. Click on "refresh" button on Git Repositories pane. The github repo's name should now be available in the drop-down list. Select it and click on "Create Notebook Instance" at the bottom of the page.

Create IAM role for Sagemaker

You will see your notebook in "Pending" status. It will take a few minutes for the Jupyter Server to start up and clone your repo. When the status changes to "InService", click on "Open Jupyter" next to the status.

Create IAM role for Sagemaker

You will now see in a separate browser tab a familar Jupyter Notebook interface. Click on the file icon and you will see the repository that was checked out. Navigate into 'notebooks' directory and open two notebooks there:

  • tensorflow_script_mode_training_and_serving
  • tensorflow_script_mode_horovod Create IAM role for Sagemaker

It may take up to a minute for the notebooks to fully load into your browser and for the kernels to start up, so please, be patient.

Click 'Cell' -> 'Run All' in both notebooks. Some cells in the notebooks may take up to 15 min to fully execute. Be patient.

If prompted, select Jupyter kernel 'conda_tensorflow_p36'

Create IAM role for Sagemaker

Note that depending on your choice of the host machine, it may take as long as 10 min to build the container the 1st time out.

This work is based on two SageMaker examples available in AWS SageMaker Examples directory:

License

This library is licensed under the Apache 2.0 License.

About

Distributed training with SageMaker's script mode using Horovod distributed deep learning framework

License:Apache License 2.0


Languages

Language:Jupyter Notebook 99.4%Language:Python 0.5%Language:C 0.0%Language:Shell 0.0%