For this lab, we will be using the dataset in the Customer Analysis Business Case. This dataset can be found in the files_for_lab
folder.
An auto insurance company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is and claim amounts. You will help the senior management with some business questions that should help them to better understand their customers, improve their services and improve profitability.
Some business Objectives for the case study could be:
- Retain customers,
- Analyze relevant customer data,
- Develop focused customer retention programs.
Based on the analysis, take targeted actions to increase profitable customer response, retention, and growth.
- Import the necessary libraries.
- Load the
we_fn_use_c_marketing_customer_value_analysis.csv
into the variablecustomer_df
(i.e.customer_df = pd.readcsv("")
) - First, look at its main features (
head
,shape
,info
). - Rename the columns so they follow the PE8 (snake case).
- Fix the data types of any other column/columns as you might see necessary. Note that sometimes there are some features you might want to use as categorical, but they are read as numerical by python (and vice versa). For eg., if there's a column with year values like 2020, 2021, 2022, etc., this column might be read as numerical by python, but you would want to use that column as categorical data. Hint: One thing you can try is to change date column to datetime format.
- Plot a correlation matrix, and comment on what you observe.
- Plot every continuous variable. Comment what you can see in the plots.
- Do the same with the categorical variables (be careful, you may need to change the plot type to one better suited for continuous data!). Comment what you can see in the plots.
You should also delete the column
customer_id
before you can try to use a for loop on all the categorical columns. Discuss why is deleting the columncustomer_id
required. Hint: Use bar plots to plot categorical data, with each unique category in the column on the x-axis and an appropriate measure on the y-axis. - Look for outliers in the continuous variables. (Hint: There’s a good plot to do that!). In case you find outliers, comment on what you will do with them.
- Check all columns for NaN values. Decide what (if anything) you will need to do with them.
For this lab, we will be using the dataset in the Customer Analysis Business Case. This dataset can be found in files_for_lab
folder. In this lab we will explore categorical data. You can also continue working on the same jupyter notebook from the previous lab. However that is not necessary.
- Import the necessary libraries if you are starting a new notebook.
- Load the csv. Use the variable
customer_df
ascustomer_df = pd.read_csv()
. - What should we do with the
customer_id
column? - Load the continuous and discrete variables into
numericals_df
andcategorical_df
variables, for eg.:numerical_df = customer_df.select_dtypes() categorical_df = customer_df.select_dtypes()
- Plot every categorical variable. What can you see in the plots? Note that in the previous lab you used a bar plot to plot categorical data, with each unique category in the column on the x-axis and an appropriate measure on the y-axis. However, this time you will try a different plot. This time in each plot for the categorical variable you will have, each unique category in the column on the x-axis and the target(which is numerical) on the Y-axis
- For the categorical data, check if there is any data cleaning that need to perform.
Hint: You can use the function
value_counts()
on each of the categorical columns and check the representation of different categories in each column. Discuss if this information might in some way be used for data cleaning.
For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.
- Open the
categoricals
variable we created before.
categoricals = data.select_dtypes(np.object)
categoricals.head()
- Plot all the categorical variables with the proper plot. What can you see?
- There might be some columns that seem to be redundant, check their values to be sure. What should we do with them?
- Plot time variable. Can you extract something from it?
For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.
So far we have worked on EDA
. This lab will focus on data cleaning and wrangling from everything we noticed before.
- We will start with removing outliers. So far, we have discussed different methods to remove outliers. Use the one you feel more comfortable with, define a function for that. Use the function to remove the outliers and apply it to the dataframe.
- Create a copy of the dataframe for the data wrangling.
- Normalize the continuous variables. You can use any one method you want.
- Encode the categorical variables
- The time variable can be useful. Try to transform its data into a useful one. Hint: Day week and month as integers might be useful.
- Since the model will only accept numerical data, check and make sure that every column is numerical, if some are not, change it using encoding.
Hint for Categorical Variables
- You should deal with the categorical variables as shown below (for ordinal encoding, dummy code has been provided as well):
# One hot to state
# Ordinal to coverage
# Ordinal to employmentstatus
# Ordinal to location code
# One hot to marital status
# One hot to policy type
# One hot to policy
# One hot to renew offercustomer_df
# One hot to sales channel
# One hot vehicle class
# Ordinal vehicle size
data["coverage"] = data["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})
# given that column "coverage" in the dataframe "data" has three categories:
# "basic", "extended", and "premium" and values are to be represented in the same order.
For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.
- In this final lab, we will model our data. Import sklearn
train_test_split
and separate the data. - Try a simple linear regression with all the data to see whether we are getting good results.
- Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.
- Use the function to check
LinearRegressor
andKNeighborsRegressor
. - You can check also the
MLPRegressor
for this task! - Check and discuss the results.
Refer to the files_for_lab/we_fn_use_c_marketing_customer_value_analysis.csv
dataset.
- Get the numerical variables from our dataset.
- Check using a distribution plot if the variables fit the theoretical normal or exponential distribution.
- Check if any of the transformations (log-transform, etc.) we have seen up to this point changes the result.