Introduction

This walkthrough illustrates how to use the OctoML model-optimization platform with the Azure CLI to deploy a model to an Azure ML Managed Endpoint in about 30 minutes.

The OctoML platform automatically produces a downloadable container with your model that is accelerated, configured, and ready to deploy on the hardware target of your choice. You can learn more here.

Prerequisites

The Azure CLI (v2) w/ the Azure ML Extension
OctoML Platform Access
The OctoML SDK (platform access token required)
Download the OctoML CLI

NOTE: While the CLI is publicly available to try, an access token is required to get the full benefits of OctoML's model-optimization stack. Learn more about the CLI here.

Deploy an Optimized Model to an Azure ML Managed Endpoint

1. Upload and Optimize a Model

From the Explore UI page, upload a model in an ONNX, Pytorch, Tensorflow, or SciKit model format. Click the Explore button to continue to the model optimization workflow settings page.

2. Explore Cloud Cost Savings

In the next page, select the hardware targets you would like to explore potential performance and cost savings.
Click the blue Explore button. In the background, the OctoML ML Compiler stack will run a sweep of optimization strategies and benchmark the results on real, production hardware.
Return to the Explore UI to see the results. Typically the "best" model package is the one with the lowest overall cost profile and the highest requests-per-second. However, we've designed the constraints UI to optimize around other SLA considerations, such as latency and throughput.

3. Package a Model for Production Serving

Click on one of the benchmarking results (plot points) to expand the packaging UI.
Download the docker tarball artifact to the same directory as the azure_deploy.sh script.

4. Deploy a Model Using the Azure CLI w/ the ML Extension

Provided as a shortcut, the azureml_deploy.sh will create the AML hosted endpoint and model deployment. The script leans heavily on the Azure CLI+ML Extension to orchestrate resource and service creation and management.

Update the environment variables at the top of the file to reflect your Azure / AML resource names.

# Set parameters
# User Azure Workspace parameters
RESOURCE_GRP=<your-resource-group-name>
WORKSPACE_NAME=<your-workspace-name>

# Model parameters
MODEL_NAME=<your-model-name> #e.g. resnet50_v2_7
TARBALL_NAME=<octomized model tarball name> # e.g.resnet50_v2_7.tar.gz

# Azure Container Registry parameters
ACR_NAME=<your-acr-registry-name> # e.g. octoml_gtm

# Azure ML Managed Endpoint parameters
ENDPOINT_NAME=<your-endpoint-name> # e.g. my-octoml-service-endpoint
DEPLOYMENT_NAME=<your-deployment-name> # e.g. octoml-model-dep-1
INSTANCE_TYPE=<your-endpoints-instance-type> # e.g. Standard_DS4_v2

From the command line, run bash azureml_deploy.sh. This process will take appromixately 20 minutes to complete.

5. Test the Model Service

Run an inferencing test with python3 run.py --remote.

NOTE: To use run.py you must install the python packages in the environment.yaml file and tritonclient:

conda create env -f environment.yaml
conda activate octo_aml
pip3 install tritonclient[all]

This will create a clean conda environment called octo_aml.

octoml / deploy_octo_azureml