This is a capstone project for Udacity Nanodegree: Azure Machine Learning Engineer. It demonstrates the usage of Azure ML platform to forecast the price of a crypto pair, Ethereum/Bitcoin as an example dataset.
We will examine 2 approaches: using HyperDrive to tune a manually prepared model training, and using AutoML to automate the entire process. Then, we will pick the best model and deploy as a web service to do real time forecasting.
Table of Contents
Here is the workflow overview of the project.
Section outline:
- 0. Project set up and installation
- 1. Choose a dataset
- 2. Import dataset into the workspace
- 3. Train model using HyperDrive
- 4. Train model using AutoML
- 5. Compare model performance
- 6. Deploy best model
- 7. Test model endpoint
Below is instructions to set up Azure ML environment to host the Jupyter Notebooks in this repository
az login
# Azure ML CLI v1
# az extension add -n azure-cli-ml
# To use v2 instead, remove the old ones
# az extension remove -n azure-cli-ml
# az extension remove -n ml
# Add Azure ML CLI v2
az extension add -n ml
az version
# {
# "azure-cli": "2.27.0",
# "azure-cli-core": "2.27.0",
# "azure-cli-telemetry": "1.0.6",
# "extensions": {
# "ml": "2.0.1a5"
# }
# }
# create resource group
az group create -l westus -n Nanodegree
# set default current working resource group and workspace to avoid retyping
az configure --defaults group="Nanodegree" workspace="Capstone"
az ml workspace create
az ad sp create-for-rbac --sdk-auth --name ml-auth
# get object Id of the service principal programmatically (using jq tool)
$clientId = az ad sp create-for-rbac --sdk-auth --name ml-auth | jq .clientId
$objectId = az ad sp show --id $clientId | jq .objectId
az ml workspace share --user $objectId --role owner
az ml compute create --name capstone-notebook --size Standard_D2s_v3 --type ComputeInstance
I use historical price data of ETHBTC (Ethereum/Bitcoin) crypto pair in daily period for over 3 years. The data is downloaded from Binance.us, and it is uploaded to my Azure ML workspace to register as a Dataset.
It is a time series dataset that includes a time column, OHLC columns, and volume column.
- time: the daily time period
- open: the opening price of a period
- high: the highest price in a period
- low: the lowest price in a period
- close: the closing price of a period
- volume: the buy/sell amounts of the market in a period
My goal is to forecast the future price of ETHBTC. Since this is a time series data, the time
column is a required feature. For the purpose of this project, I will only target the close
price to forecast. In other words, this is a univariate analysis where the other value features are not considered in the model.
The data is uploaded as a Dataset in Azure ML Studio, and it can be accessed by name using Python SDK. A copy of the time series data is stored as a CSV file in this project repository.
Preview access in Jupyter Notebook
Full notebook: hyperparameter_tuning.ipynb
I use Facebook Prophet model to forecast the close price. Two hyperparameters are used for tuning with HyperDrive, each is a discrete option in a random set of 5:
--changepoint-prior-scale
that dictates the flexibility of the automatic changepoint selection. Large values will allow many changepoints, small values will allow few changepoints.--seasonality-prior-scale
that dictates the strength of the seasonality model. Larger values allow the model to fit larger seasonal fluctuations, smaller values dampen the seasonality.
The termination policy is Bandit Policy that is based on slack factor/slack amount and evaluation interval. The general purpose is to avoid burning the computation resource on the training processes that are unlikely to yield better result.
The performance metric is Root Mean Squared Error, which means the lower the better; therefore, the primary metric goal is set to MINIMIZE
.
The custom train environment that has all necessary dependencies is configured through the provided Dockerfile.
RMSE: 0.013
We use root mean squared error as a performance metric to compare this HyperDrive model with the AutoML model below.
Full notebook: automl.ipynb
We use the same train data that has only time and close price column to build the forecasting model. Here we configure a forecasting task for Azure AutoML and submit the experiment.
In the ForecastingParameters, we set the time column name the same as the time
column in our train data. Because the training data is in daily period, we set the frequency as daily, freq='D'
. Finally, we set the forecast horizon as 20
days in the future that we plan to train the model for. In practice, we don't need nor want a very long forecast into the future price because it will not be accurate in a highly volatile financial system such as cryptos. As soon as we have new price data, we should train again.
The supplied train data only contains time
column and a target column close
price in order to only do univariate forecasting.
Regarding the evaluation metric, we pick the default for AutoML forecasting task which is normalized_root_mean_squared_error
, but afterward we can extract root_mean_squared_error
that is part of the available metrics of the trained model because we need to compare with the HyperDrive model on the same metric.
The rest of the configurations for the AutoML is not domain specific and can be changed as will. That only gives room for the experiment.
The best model is trained with a Decision Tree regression algorithm and having the data standardized by a Robust Scaler, which is based on percentiles and are therefore not influenced by a few number of very large marginal outliers.
The hyperparameters are as follows:
Robust Scaler data transformation
{
"class_name": "RobustScaler",
"module": "sklearn.preprocessing",
"param_args": [],
"param_kwargs": {
"quantile_range": [
10,
90
],
"with_centering": false,
"with_scaling": false
},
"prepared_kwargs": {},
"spec_class": "preproc"
}
Decision Tree training algorithm
{
"class_name": "DecisionTreeRegressor",
"module": "sklearn.tree",
"param_args": [],
"param_kwargs": {
"criterion": "mse",
"max_features": null,
"min_samples_leaf": 0.006781961770526707,
"min_samples_split": 0.0008991789964660114,
"splitter": "best"
},
"prepared_kwargs": {},
"spec_class": "sklearn"
}
These information can also be extracted programmatically.
RMSE: 0.003
Based on the RMSE metrics, the AutoML model (RMSE=0.003) appears to have 3 times lower error than the HyperDrive model (RMSE=0.01). This is just a rough comparison because the exact comparison is hard to conduct due to differences on how AutoML evaluates and how the HyperDrive train script evaluates. However, with a factor of 3, that probably covers the margin of error in the comparison method and gives confidence in the AutoML model; therefore, we will choose the AutoML model as best model to deploy.
Here we go through the steps to register the model and configure the inference enpoint to deploy as a web service on Azure Container Instance.
A sample scoring script can be downloaded from the model outputs directory
Use the same environment as the model and entry script as the downloaded scoring script
Since this is a univariate forecasting on the close
price, the only input needed is the time
value, and the time should be in the future of the train dataset (2021-09-08) and up to 20 days in the horizon. In other words, any day from 2021-09-09 to 2021-09-28 will be valid. Usually, the nearer in the future, the higher the confidence.
Example data format of the JSON request body:
"data":
[
{
"time": "2021-09-10T00:00:00.000Z",
}
],
Example response, in which the forecasted close price is in the value of forecast
property:
{
"forecast": [
0.0648978
],
"index": [
{
"time": 1631232000000,
"_automl_dummy_grain_col": "_automl_dummy_grain_col"
}
]
}
No more web service endpoint is available after the deletion.
No more cluster is available after the deletion.
Screencast video demonstrating the live model deployed as a web service to handle forecast request.
-
Apply multivariate forecasting using multiple available features such open, high, low, and volume instead of just only close price. This will surely improve the prediction quality. Furthermore, we can combine other time series data such as other cryptos, stock index data, and inflation data, as they might affect the price movement of the target crypto. An algorithm I have in mind is Vector Autoregression if I write a local training script, and for AutoML I simply not exclude the other features because the AutoML forecasting task will apply multivariate by default.
-
Increasing experiment timeout and number of cross validation for AutoML can also help find better performance model.
-
For a real financial forecasting platform, it's necessary to have a frequent retrain whenever new price data is available, so we can set up a pipeline that automatically downloads new price data and trigger a retrain.