For this lab, we will be using the dataset in the Customer Analysis Business Case. This dataset can be found in files_for_lab
folder. In this lab we will explore categorical data.
- Import the necessary libraries if you are starting a new notebook.
- Load the continuous and discrete variables into
continuous_df
anddiscrete_df
variables. - Plot a correlation matrix, what can you see?
- Create a function to plot every discrete variables. Do the same with continuous variables (be careful, you may change the plot type to another one better suited for continuous data).
- What can you see in the plots?
- Look for outliers in the continuous variables we have found. Hint: There was a good plot to do that.
- Have you found outliers? If you have, what should we do with them?
- Check nan values per column.
- Define a function that differentiate between continuous and discrete variables. Hint: Number of unique values might be useful. Store continuous data into a
continuous
variable and do the same fordiscrete
and categorical. - for the categorical data, check if there is some kind of text in a variable so we would need to clean it. Hint: Use the same method you used in step 7. Depending on the implementation, decide what to do with the variables you get.
- Get categorical features.
- What should we do with the customer id column?