devopsotrator/mlops-sagemaker-github-actions

Layout of the SageMaker Project Template

The template provides a starting point for bringing your SageMaker Pipeline development to production.

├── build_deployment_configs.py
├── CONTRIBUTING.md
├── deploy_stack.py
├── endpoint-config-template.yml
├── img
│   └── pipeline-full.png
├── LICENSE
├── pipelines
│   ├── abalone
│   │   ├── evaluate.py
│   │   ├── __init__.py
│   │   ├── pipeline.py
│   │   └── preprocess.py
│   ├── get_pipeline_definition.py
│   ├── __init__.py
│   ├── run_pipeline.py
│   ├── _utils.py
│   └── __version__.py
├── prod-config.json
├── README.md
├── sagemaker-pipelines-project.ipynb
├── setup.cfg
├── setup.py
├── staging-config.json
├── tests
│   ├── test_endpoints.py
│   └── test_pipelines.py
└── tox.ini

Start here

This is a sample code repository that demonstrates how you can organize your code for an ML business solution. This code repository is created as part of creating a Project in SageMaker.

In this example, we are solving the abalone age prediction problem using the abalone dataset (see below for more on the dataset). The following section provides an overview of how the code is organized and what you need to modify. In particular, pipelines/pipelines.py contains the core of the business logic for this problem. It has the code to express the ML steps involved in generating an ML model. You will also find the code for that supports preprocessing and evaluation steps in preprocess.py and evaluate.py files respectively.

Once you understand the code structure described below, you can inspect the code and you can start customizing it for your own business case. This is only sample code, and you own this repository for your business use case. Please go ahead, modify the files, commit them and see the changes kick off the SageMaker pipelines in the CICD system.

You can also use the sagemaker-pipelines-project.ipynb notebook to experiment from SageMaker Studio before you are ready to checkin your code.

A description of some of the artifacts is provided below:

Your codebuild execution instructions. This file contains the instructions needed to kick off an execution of the SageMaker Pipeline in the CICD system (via CodePipeline). You will see that this file has the fields definined for naming the Pipeline, ModelPackageGroup etc. You can customize them as required.

.github/workflows/build.yml

Your pipeline artifacts, which includes a pipeline module defining the required get_pipeline method that returns an instance of a SageMaker pipeline, a preprocessing script that is used in feature engineering, and a model evaluation script to measure the Mean Squared Error of the model that's trained by the pipeline. This is the core business logic, and if you want to create your own folder, you can do so, and implement the get_pipeline interface as illustrated here.

|-- pipelines
|   |-- abalone
|   |   |-- evaluate.py
|   |   |-- __init__.py
|   |   |-- pipeline.py
|   |   `-- preprocess.py

Utility modules for getting pipeline definition jsons and running pipelines (you do not typically need to modify these):

|-- pipelines
|   |-- get_pipeline_definition.py
|   |-- __init__.py
|   |-- run_pipeline.py
|   |-- _utils.py
|   `-- __version__.py

Python package artifacts:

|-- setup.cfg
|-- setup.py

A stubbed testing module for testing your pipeline as you develop:

|-- tests
|   `-- test_pipelines.py

The tox testing framework configuration:

tox.ini

Deploy your model

This is a sample code repository for demonstrating how you can organize your code for deploying an realtime inference Endpoint infrastructure. This code repository is created as part of creating a Project in SageMaker.

This code repository has the code to find the latest approved ModelPackage for the associated ModelPackageGroup and automaticaly deploy it to the Endpoint on detecting a change (build.py). This code repository also defines the CloudFormation template which defines the Endpoints as infrastructure. It also has configuration files associated with staging and prod stages.

Upon triggering a deployment, the CodePipeline pipeline will deploy 2 Endpoints - staging and prod. After the first deployment is completed, the CodePipeline waits for a manual approval step for promotion to the prod stage.

You own this code and you can modify this template to change as you need it, add additional tests for your custom validation.

A description of some of the artifacts is provided below: The GitHub Actions workflow to build and deploy Endpoints.

.github/workflows/deploy.yml

build.py

this python file contains code to get the latest approve package arn and exports staging and configuration files. This is invoked from the Build stage.

endpoint-config-template.yml

this CloudFormation template file is packaged by the build step in the CodePipeline and is deployed in different stages.

staging-config.json

this configuration file is used to customize staging stage in the pipeline. You can configure the instance type, instance count here.

prod-config.json

this configuration file is used to customize prod stage in the pipeline. You can configure the instance type, instance count here.

test\test.py

this python file contains code to describe and invoke the staging endpoint. You can customize to add more tests here.

Dataset for the Example Abalone Pipeline

The dataset used is the UCI Machine Learning Abalone Dataset [1]. The aim for this task is to determine the age of an abalone (a kind of shellfish) from its physical measurements. At the core, it's a regression problem.

The dataset contains several features - length (longest shell measurement), diameter (diameter perpendicular to length), height (height with meat in the shell), whole_weight (weight of whole abalone), shucked_weight (weight of meat), viscera_weight (gut weight after bleeding), shell_weight (weight after being dried), sex ('M', 'F', 'I' where 'I' is Infant), as well as rings (integer).

The number of rings turns out to be a good approximation for age (age is rings + 1.5). However, to obtain this number requires cutting the shell through the cone, staining the section, and counting the number of rings through a microscope -- a time-consuming task. However, the other physical measurements are easier to determine. We use the dataset to build a predictive model of the variable rings through these other physical measurements.

We'll upload the data to a bucket we own. But first we gather some constants we can use later throughout the notebook.

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

devopsotrator / mlops-sagemaker-github-actions

Layout of the SageMaker Project Template

Start here

Deploy your model

Dataset for the Example Abalone Pipeline

About

Languages