Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."

The dataset used in this project is the UCI Bank Marketing dataset which contains data about a marketing campaign of a financial institution. The goal here is to predict and identify clients who are likely to start or subscribe a term deposit account. This will go a long way to improve future marketing campaigns for the bank as it will predict the likelihood of a client to subscibe to a product that is being adverstised.

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." The best performing model was the VotingEnsemble with an accuracy of 0.9171 which was gotten from AutoML.

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. The pipeline can be shown in 5 steps:

Data Acquisition: The data was acquired from a provided url using TabularDatasetFactory.
Data Wrangling/Cleaning The acquired data was extracted and cleaned with the clean_data method. The data was divided into inputs(x) and target(y).
Train/Test Split The data was then split into train and test data using scikit learn's train_test_split. 75% for training and 25% for testing.
Classification Algorithm To perform the binary classification on the dataset, Logistic Regression was chosen as the classification algorithm.
Hyperparameter Tuning With Hyperdrive The hyperparameters were the set to determine the best possible value for the Logistic Regression classifier. The 2 main hyperparameters uses were the inverse of regularization strength(C) and the max_iter. For C,a uniform range between 0.5 and 2.0 was chosen while for the maximum iterarions, 4 discrete values; [10, 20, 25, 50] were chosen. The termination policy chosen was the BanditPolicy that is based on the slack factor amount. The slack amount used was 0.1 and an evaluation interval of 2.
Save Best Model Selected and saved best model from tuning.

What are the benefits of the parameter sampler you chose? Random Parameter Sampling supports both continuous and discrete hyperparameters. It also supports early stopping which saves time and computing resources.

What are the benefits of the early stopping policy you chose? Early stopping improves computational efficiency and saves time by terminating training runs that perform poorly.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML. VotingEnsemble classifier was the best model generated by AutoML. It uses LightGBM as the model which gave the best accuracy of 91.71%. AutoML also did some exploratory data analysis and discovered that the dataset was unbalanced. The hyperparameters generated for the best model were:

(prefittedsoftvotingclassifier,l1_ratio=0.83673469387, learning_rate='constant', loss='modified_huber', max_iter=1000,n_jobs=1,penalty='l2',power_t=0.2222222222222222, random_state=None,tol=0.0001)

We can see that, penalty is set to l2, l1_ratio is 0.8367,learning rate is constant and max_iter is set to 1000

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?

AutoML was able to run and tune hyperparameters of several algorithms(37 to be exact) while with hyperdrive only one algorithm's hyperparameters could be tuned.
It was easier setting the configuration defintions of AutoML than hyperdrive.
AutoML was able to perform datacecks and determine that classes were not balanced. This was not possible with hyperdrive.
AutoML produced a better accuracy of 91.71% while hyperdrive gave an accuracy of 90.09%. This is because they use different classifiers, logistic regression for hyperdrive and votingEnsemble classifier for AutoML.
Finally, AutoML is alot slower than hyperdrive because unlike hyperdrive than runs only 1 algorithm, AutoML has to run several.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model?

AutoML indicated that dataset was imbalanced just before training the model. In order to improve model performance, we will use methods like oversampling, undersampling and/or SMOTE to balance the dataset before training. This will improve the model performance that by ensuring that the model is actually identifying the minority class correctly.

In addition, to improve model performance, more data cleaning will be done. Some cleaning strategies include:

Some columns don't seem necessary like the default column.
Drop columns that are highly correlated using methods like principal components analysis. We drop these columns because they bring the same information to the model, so it is logical to remove one of them.
Check for outliers in numerical columns like age and remove them. Outliers have to be removed because they can cause the results of predictions to be skewed either positively or negatively which affects model performance negatively.

Doing all these, will go a long way to improve model performance as explained above.

Proof of cluster clean up

If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section. Image of cluster marked for deletion

ivyclare / Optimizing-ML-Pipeline-in-Azure