To run the Jupyter notebooks and python scripts, you will need a standard installtion of Anaconda with Python 3.6.x
Additional libraries needed:
- sklearn
- imblearn
- keras
- tensorflow (tensorflow-gpu is preferred as the neural network training can take quite a lot of time)
This project was done as the capstone requirement for Udacity's Data Scientist Nanodegree. The goal was to characterize what types of individuals are more likely to be customers of a mail-order retailer and predict which customers would respond positively to marketing campaigns.
The data used for this project not publically available. It was provided only to those participating in the "in class" competition.
- features.csv - data dictionary
- Arvato Report.pdf - Analysis report
- segmentation/Arvato Project Workbook.ipynb - Data expoloration and preprocessing
- segmentation/Customer Segmentation Report.ipynb - Analysis of customers
- segmentation/Mailout.ipynb - Analysis of mailout data using clustering model
- segmentation/clean_data.py - Python script for cleaning the segmentation data
- segmenation/fit_clustering.py - File containing clustering pipeline function. This can also be used as a standlone script.
- supervised/Supervised Learning Using Ensemble Methods.ipynb - Classification using ensemble methods
- supervised/Supervised Learning Using Keras.ipynb - Classification using a neural network
- supervised/clean_data - Python script for cleaning classification data
- supervised/preprocess.py - Python file for preprocessing functions
- Clean population and customer data
- From the segmentation directory, run:
python clean_data.py [data_dir]/Udacity_AZDIAS_052018.csv ../features.csv
- From the segmentation directory, run:
python clean_data.py [data_dir]/Udacity_CUSTOMERS_052018.csv.csv ../features.csv
- Run the Customer Segmentation Report notebook
- Clean the training and test data
-
From the supervised directory, run:
python clean_data.py [data_dir]/Udacity_MAILOUT_052018_TRAIN.csv ../features.csv
-
From the supervised directory,
run: python clean_data.py [data_dir]/Udacity_MAILOUT_052018_TEST.csv ../features.csv
- Run the Supervised Learning Using Ensemble Methods notebook
The detailed analysis of the results can be read in this Medium post or in Arvato Report.pdf.
- One group was found to be more likely to be customers: These indivduals were more religious, older and savers.
- Two groups were found to be less likely to be customers: 1) Individuals with low purchasing activity and wealth (also younger) and 2) Individuals from areas with low population density and were less cultural minded/religiousness
The final model had an auc_roc score of 0.76294 and a Kaggle score of 0.80143 (https://www.kaggle.com/c/udacity-arvato-identify-customers/leaderboard).
Model | Local score | Kaggle Score |
---|---|---|
Keras | 0.65836 | 0.65842 |
Gradient Boost | 0.76524 | 0.79327 |
AdaBoost | 0.76238 | 0.79791 |
LightGBM (final) | 0.76294 | 0.80143 |