An example of logistic regression being used to detect fraud
The kaggle dataset can be found here.
A couple of challenges with the dataset
- The dataset uses PCA to hide sensitive information, making it difficult to understand what features have the best correlation with fraud.
- There are a lot of missing values. Over Half of the hundreds of thousands of rows have missing data.
- There is a large class imbalance. around 4 percent of the data is fraud.
- One-hot encoding categorical variables.
- Filtering columns that have over 17% missing values.
- Plotting distribution of null values to get an intuition of what columns may be important.