HEALTH-INSURANCE-CROSS-SELL-PREDICTION

Project Classification : Insurance Cross-Selling

-- Project Status: [Completed]

Project Summary :

Problem Statement :

The task is to use existing Health and Vehicle Insurance Customer Data to predict whether the any new Customers are open to purchasing Vehicle Insurance from this company .

About the Data :

We have the data of existing Health Insurance Customers , this Data includes 12 relevant data points such as age, gender, sales channel data, vehicle ownership data. And most importantly , the target variable : whether the customer has vehicle insurance or not . The Data is available for 390K existing customers .

Approach taken :

The task was divided into 2 main parts : 1.Statistical Analysis over the dataset to discover relationships between each feature and the target variable . So that this relationship information can be used by the management in making better Business decisions 2.Creating a Machine Learning Pipeline , that can take in the data of any new customer and predict whether they will be interested in vehicle insurance . It was required to kepp this pipeline modular , such that it can be retrained often when new data is collected

Technical Details for ML : We trained 5 Different Algorithms

( Logistic Regression , K-Nearest Neighbors , Random Forest , XGBoost and CatBoost ) We used GridSearchCV and BayesSearchCV for HyperParameter Tuning Comparing both F1 and AUC-ROC Score , we can see that Random Forrest and XGBoost model performs the best . Recall of 65%, Best AUC-ROC = 0.86 , Best F1=0.44

Conclusions : Insights from exploring the Data :

● Customers of age between 30 and 70 are more likely to buy insurance. ● Customers with Driving Licence have higher chance of buying Insurance. ● Customers with Vehicle Damage are more likely to buy insurance. ● Customers with Vehicle age between 1 and 2 years are more likely to interested. ● Customer who are not insured previously are more likely to be interested.

Python Libraries used

Datawrangling :

Numpy
Pandas

For Graphing :

Matplotib
Seaborn

Machine learning :

Scikit-Learn
SK-Opt
XGBoost
CatBoost

Miscellaneous :

Google colab tools

Getting Started

The Structure of the main IPYNB notebook :

About this Project
Problem Statement
Bussiness Goal
Approach Taken in this Project
- Understanding the given Data
Initial Code : Initliaing the Data and Modules
- Installing and Importing Libraries
- Import Dataset and Initial Data Checks
Data Preparation and Cleaning
Exploratory Data Analysis
- Initial Macro-Level Data Analysis
- Variable wise EDA
  - Target Variable (Response)
  - Age variable
  - Annual_Premium
  - Gender variable
  - Driving License
  - Previously Insured
  - Vehicle Age
  - Vechicle damage
  - Vintage
  - Region Code
  - Policy Sales Channel
- Correlation Plot for Numeric Features
Data Preprocessing and Feature Engineering
- Outlier Treatment in feature : Annual_Premium
- Label Encoding
- Target Mean Encoding
- Cleaned Data Exporting
Building Prediction Systems using ML Models
- Import cleaned final data
- Classifier Performance Reporting Function
  - Overfitting Underfitting Debugging Notes
  - Metrics to be used during HyperParameter
  - Random Forrest Specific Cutom Defined Metrics
  - Function Definations for Analytics report generation
- Logistic Regression Classifier Algorithm
  - LR Classifier Generator Function
  - LR Hyper Parameter Tuning : GridSearch
  - Final Logistic Regression Training run
- K Nearest Neighbours Classifier Algorithm
  - Default Parameters : KNeighborsClassifier
  - KNN Model Generator Funciton
  - KNN Hyper Parameter Tuning : GridSearch
  - Final KNN Training run
- Random Forrests of Decsision Trees
  - Default Parameters : RandomForestClassifier
  - Base Estimator Generator Function
  - HyperParameter tuning using GridSearchCV
  - Final Training Run
- Gradient Boosted Trees using XGBoost Library
  - XGBoost algorithn training and tuning notes
  - XGBoost Estimator Instance Generator Function
  - HyperParameter tuning using BayesSearchCV
  - Final Training Run for XGBoost
  - Feature Importance
- Categorical Gradient Boosted Trees using CatBoost Library
  - Cleaning Raw Data with Categorical Encoding
  - CatBoost Estimator Instance Generator Function
  - Model Evaluation
  - Final Training Run for CatBoost
  - Feature Importance
Inferences and Conclusions
What Worked? What Did Not Work?
Future Work and More Ideas to Explore
- Custom Metric , based on Cost of approaching te customer , so as to create

me-smishra27 / HEALTH-INSURANCE-CROSS-SELL-PREDICTION