Udacity-project

OverView

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run

Summary

Problem Statement Analysing the Bankmarketing dataset and predicting a person will do term deposit or not. Here "Y" defines whether will invest and we need the analyse the parameters that could affect the deposit in the next bank marketing model.

The best performing model was voting ensemble model with an accuracy of 0.9171 which was performed by Auto-ML run and is slightly better than the Logistic-Regression

Scikit-learn Pipeline

Architecture First a compute instance is created to run on virtual machine . Once the compute instance is created, we assign the compute instance to run our Jupyter Notebook and access our workspace, create an experiment and run the models

Steps Involved

First the workspace.config() is defined and experiment is created

We have to check for the compute instance and if it is not created then the compute instance is created

steps in train.py

The data is first loaded from the url

The Data is fed into the clean data to remove null values

The data is split in training and testing data

The training data is fed into Logistic Regression

HyperParameters: C: Inverse of regularisation

                  I have chosen discrete parameters choice(100, 10, 1.0, 0.1, 0.01)
                  
                  *max_iter:* number of iterations possible
                  
                  here I have chosen discrete paramters choice(100, 110, 120)

RandomParameterSampling In this method the hyperparameters could be discrete and continous,both are accepted

Early termination policy I have used Bandit policy. Bandit Policy is based on slack factor. Bandit terminates runs where the primary metric is not within the slack factor compared to the best performing run

slack factor: The slack allowed with respect to the best performing run

AutoML pipeline

My AutoML ran for 27 iterations in 30 mintutes. The best model is votingEnsemble . The primary metric I used is Acuraccy and Number of cross_validations as 6 for my AutoML configuation and I got an accuracy score of 0.9171

And the autoML model is min_samples_split=0.2442105263157895, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False

Comparision of the both pipelies

In HyperDrive, I was able to tune the hyperparameters of logistic regression with different parameters whereas in the AutoML, I was able to apply different algorithm model on my dataset , ensemble model has has low bias and low variance hence the accuracy of the ensemble model is higher than the logistic regression model

Future Work

In future experiments, I want to try other models with with classification and see how they work. And I would also try median stopping policy as early termination policy and see how the accuracy of the model changes since this policy computes running averages across all training runs and terminates runs with primary metric values worse than the median of averages.And I would see also reduce the number of cross folds and see how it affects accuracy as the number of folds increases , the time training for the model also increases and hence cost of training also increases.

123manju900 / Udacity-project