The task is to use existing Health and Vehicle Insurance Customer Data to predict whether the any new Customers are open to purchasing Vehicle Insurance from this company .
We have the data of existing Health Insurance Customers , this Data includes 12 relevant data points such as age, gender, sales channel data, vehicle ownership data. And most importantly , the target variable : whether the customer has vehicle insurance or not . The Data is available for 390K existing customers .
The task was divided into 2 main parts : 1.Statistical Analysis over the dataset to discover relationships between each feature and the target variable . So that this relationship information can be used by the management in making better Business decisions 2.Creating a Machine Learning Pipeline , that can take in the data of any new customer and predict whether they will be interested in vehicle insurance . It was required to kepp this pipeline modular , such that it can be retrained often when new data is collected
( Logistic Regression , K-Nearest Neighbors , Random Forest , XGBoost and CatBoost ) We used GridSearchCV and BayesSearchCV for HyperParameter Tuning Comparing both F1 and AUC-ROC Score , we can see that Random Forrest and XGBoost model performs the best . Recall of 65%, Best AUC-ROC = 0.86 , Best F1=0.44
● Customers of age between 30 and 70 are more likely to buy insurance. ● Customers with Driving Licence have higher chance of buying Insurance. ● Customers with Vehicle Damage are more likely to buy insurance. ● Customers with Vehicle age between 1 and 2 years are more likely to interested. ● Customer who are not insured previously are more likely to be interested.
Datawrangling :
- Numpy
- Pandas
For Graphing :
- Matplotib
- Seaborn
Machine learning :
- Scikit-Learn
- SK-Opt
- XGBoost
- CatBoost
Miscellaneous :
- Google colab tools
-
About this Project
-
Problem Statement
-
Bussiness Goal
-
Approach Taken in this Project
- Understanding the given Data
-
Initial Code : Initliaing the Data and Modules
- Installing and Importing Libraries
- Import Dataset and Initial Data Checks
-
Data Preparation and Cleaning
-
Exploratory Data Analysis
- Initial Macro-Level Data Analysis
- Variable wise EDA
- Target Variable (Response)
- Age variable
- Annual_Premium
- Gender variable
- Driving License
- Previously Insured
- Vehicle Age
- Vechicle damage
- Vintage
- Region Code
- Policy Sales Channel
- Correlation Plot for Numeric Features
-
Data Preprocessing and Feature Engineering
- Outlier Treatment in feature : Annual_Premium
- Label Encoding
- Target Mean Encoding
- Cleaned Data Exporting
-
Building Prediction Systems using ML Models
-
Import cleaned final data
-
Classifier Performance Reporting Function
- Overfitting Underfitting Debugging Notes
- Metrics to be used during HyperParameter
- Random Forrest Specific Cutom Defined Metrics
- Function Definations for Analytics report generation
-
Logistic Regression Classifier Algorithm
- LR Classifier Generator Function
- LR Hyper Parameter Tuning : GridSearch
- Final Logistic Regression Training run
-
K Nearest Neighbours Classifier Algorithm
- Default Parameters : KNeighborsClassifier
- KNN Model Generator Funciton
- KNN Hyper Parameter Tuning : GridSearch
- Final KNN Training run
-
Random Forrests of Decsision Trees
- Default Parameters : RandomForestClassifier
- Base Estimator Generator Function
- HyperParameter tuning using GridSearchCV
- Final Training Run
-
Gradient Boosted Trees using XGBoost Library
- XGBoost algorithn training and tuning notes
- XGBoost Estimator Instance Generator Function
- HyperParameter tuning using BayesSearchCV
- Final Training Run for XGBoost
- Feature Importance
-
Categorical Gradient Boosted Trees using CatBoost Library
- Cleaning Raw Data with Categorical Encoding
- CatBoost Estimator Instance Generator Function
- Model Evaluation
- Final Training Run for CatBoost
- Feature Importance
-
-
Inferences and Conclusions
-
What Worked? What Did Not Work?
-
Future Work and More Ideas to Explore
- Custom Metric , based on Cost of approaching te customer , so as to create