For this lab, we will be using the dataset in the Customer Analysis Business Case of the previous lab. This dataset can be found in files_for_lab
folder. In this lab we will explore categorical data.
As in this lab, we will keep working on the same dataset as the previous lab, please make a copy of the final Jupyter notebook of the previous lab in the current lab folder. Next, use Markdown to add a new section in the Jupyter notebook named Lab Cleaning Categorical Data
. Then restart the Kernel and run all the previous cells. Finally, keep working on the same notebook according to the next instructions.
- Define a function that given a pandas DataFrame as input creates a seaborn countplot of each categorical column. Make sure to sort the bars by frequency ie: the most frequent values should be placed first. Hint: use .value_counts(). In addition, if the amount of unique values of a categorical column (cardinality) is six or more, the corresponding countplot should have the bars placed on the y-axis instead of the x-axis.
policy_type
andpolicy
columns are redundant, and what's worsepolicy
column has a lot of possible unique values (high cardinality) which will be problematic when they will be dummified with an OneHotEncoder because we will increase a lot the number of columns in the dataframe. Drop the columnpolicy_type
and transform the columnpolicy
to three possible values: L1, L2, and L3 using a function.- Time dependency analysis. Use a seaborn line plot using the column
effective_to_date
to see iftotal_claim_amount
is bigger at some specific dates. Use a figsize=(10,10) - To continue the analysis define an empty pandas DataFrame, and add the following new columns:
day
with the day number ofeffective_to_date
day_name
with the day NAME ofeffective_to_date
week
with the week ofeffective_to_date
month
with the month NAME ofeffective_to_date
total_claim_amount
withtotal_claim_amount
- Compute the total
target
column aggregatedday_name
rounded to two decimals and then reorder the index of the resulting pandas series using.reindex(index=list_of_correct_days)
- Use a seaborn line plot to plot the previous series. Do you see some differences by day of the week?
- Get the total number of claims by day of the week name and then reorder the index of the resulting pandas series using
.reindex(index=list_of_correct_values)
- Get the median "target" by day of the week name and then sort the resulting values in descending order using .sort_values()
- Plot the median "target" by day of the week name using a seaborn barplot
- What can you conclude from this analysis?
- Compute the total
target
column aggregatedmonth
rounded to two decimals and then reorder the index of the resulting pandas series using .reindex(index=list_of_correct_values) - Can you do a monthly analysis given the output of the previous series? Why?
- Define a function to remove the outliers of a numerical continuous column depending if a value is bigger or smaller than a given amount of standard deviations of the mean (thr=3).
- Use the previous function to remove the outliers of continuous data and to generate a continuous_clean_df.
- Concatenate the
continuous_cleaned_df
,discrete_df
,categorical_df
, and the relevant column oftime_df
. After removing outliers the continuous_cleaned dataframe will have fewer rows (when you concat the individual dataframes usingpd.concat()
) the resulting dataframe will have NaN's because of the different sizes of each dataframe. Usepd.dropna()
and.reset_index()
to fix the final dataframe. - Reorder the columns of the dataframe to place 'total_claim_amount' as the last column.
- Turn the
response
column values into (Yes=1/No=0). - Reduce the class imbalance in
education
by grouping together ["Master","Doctor"] into "Graduate" while keeping the other possible values as they are. In this way, you will reduce a bit the class imbalance at the price of losing a level of detail. - Reduce the class imbalance of the
employmentstatus
column grouping together ["Medical Leave", "Disabled", "Retired"] into "Inactive" while keeping the other possible values as they are. In this way, you will reduce a bit the class imbalance at the price of losing a level of detail. - Deal with column
Gender
turning the values into (1/0). - Now, deal with
vehicle_class
grouping together "Sports Car", "Luxury SUV", and "Luxury Car" into a common group calledLuxury
leaving the other values as they are. In this way, you will reduce a bit the class imbalance at the price of losing a level of detail. - Now it's time to deal with the categorical ordinal columns, assigning a numerical value to each unique value respecting the ìmplicit ordering`. Encode the coverage: "Premium" > "Extended" > "Basic".
- Encode the column
employmentstatus
as: "Employed" > "Inactive" > "Unemployed". - Encode the column
location_code
as: "Urban" > "Suburban" > "Rural". - Encode the column
vehicle_size
as: "Large" > "Medsize" > "Small". - Get a dataframe with the categorical nominal columns
- Create a list of named
levels
which that has as many elements as categorical nominal columns. Each element must be another list with all the possible unique values of the corresponding categorical nominal column: ie:
levels = [ [col1_value1, col1_value2,...], [col2_value1, col2_value2,...], ...]
- Instantiate an sklearn OneHotEncoder with drop set to
first
and categories tolevels