Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."
The dataset contain data about marketing campaigns of some banking instuition. The Analysis is about finding the best technique to improve for the next marketing campaign. The client of the bank were contacted in order to access if the bank term deposit is subscripted or not. This is a classification problem where we tried different techniques to predict subscription.
In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..."
The Voting Ensemble is the best performing model which was selected by AutoML 0.91820941 accuracy as compared to the best model of HyperDrive with 0.9072837632776934 accuracy with max iteration 25 and regualrisation strength as 0.3191910641048322

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.
The first step is to download the data from the link provided in the boiler plate code, next step is to transform the data into dummy variables (one hot encode) so that the algorithm understands it.
What are the benefits of the parameter sampler you chose?
I have chose the RandomParameter sampler with regualrisation strength from 0.1 to 0.9 and max iteration to choice from (10, 25, 50, 100). The main benefit is that the parameters which are chosen are random which can achieve the desired result much early as compared to taking each value sequentially.
What are the benefits of the early stopping policy you chose?
This policy compares the value (new best AUC reported + new best AUC reported * slack_factor) to currently best performing run after number of evaluation_interval, and if smaller, cancels the run. In my case we are using evaluation_interval=2 and slack_factor=1, so any early stopping will take place very early.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML.
AutoML has created the best accuracy model through the Voting Ensemble Algorithm in 61th iteration with 15 ensemble_iterations.

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?
The Logistic regression Model in the Scikit-Learn has maximum accuracy of 0.9072837632776934 with Max Iteration of 25 and Regularization Strength of 0.3191910641048322 but the Voting Ensemble performed better in terms of performance with 0.91820941 accuracy on the 61th iteration with 15 ensemble_iterations.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model?
We need to fix the class imbalance problem as the response variable has highly imbalance values.

Proof of cluster clean up

If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section.

Ishmeetsingh97 / nd00333_AZMLND_Optimizing_a_Pipeline_in_Azure-Starter_Files