This repository consists of 3 tasks from the virtual internship. All datasets are properties of Forage pte ltd and they reserve the final rights.
In this task, I act as a junior data analyst in KPMG's Lighthouse and Innvation Team. I have coordinated with Sprocket Central Pty Ltd , a medium size bikes & cycling accessories organisation. Sprocket Central Pty Ltd requires help for its customer and transactions data.The organisation has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.
In the spreadsheet provided, there are 3 datasets:
- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months
Upon viewing the spreadsheet, I did some prelimary data exploration and conducted some data cleaning.
i. For Customer Demographics, I removed missing, mislabelled, the column "default" and unmatched datasets.
ii. For Transactions, I also removed missing, mislabelled and data in the column "product_id" as the value of "0" is mislabelled. I have converted all incorrect data types from datetime to strings
iii. For NewCustomerList, I also removed missing, mislabeled, duplicated data. Afterwards, I replaced empty headers with "unknown headers 1-5" as no data provided for the column header. I then converted all incorrect data types from datetime to strings.
In this task, Sprocket Central Pty Ltd has given us a new list of 1000 potential customers with their demographics and attributes but with no prior transaction history with the organisation. The marketing team believes that the data would reveal useful customer insights which could help optimise resource allocation for targeted marketing. Hence, improve performance by focusing on high value customers.
Gathering the insights is divided into 3 steps,
I used the library "featuretools" to generate parent-child relationships for the column "customer id" in "CustomerDemographic" spreadsheet as the "parent" and the column "customer id" in "Transactions" spreadsheet as the "child".
Using "featuretools" library, I was able to see the types of feature; aggregation and transformative primitives
Aggrevate Primitives
Transformative Primitives
Afterwards I used the features "mean", "max", "min", "std", "skew","count","n_most_common" as the aggregation primitives and "time_since_previous" as the transformative primitives and form the following dataframe.
After forming the dataframe using feature engineering, I cleaned it by:
-
Remove all N_MOST_COMMON[1] & N_MOST_COMMON[1]
-
For Column Mean and above
- Numeric column- Fill in 0
- Caterorical- Fill in "U"
*First 6 rows of the edited dataframe
Using "Tensorflow" and "Keras" library, I used the dataframe from the feature engineering previously generated as the training set. The test set given by the client, shown in the below screenshot, will act as the test set.
Afterwards, I created a Sequential Model with:
- 2 densely connected hidden layers (nn size =64)
- Activation 'relu'
- input_shape = [4] due to 4 numeric columns
- "past_3_years_bike_related_purchases"
- "Age"
- "tenure"
- "property_valuation"
I seperated the target value or "label" from the features. This label is the value that you will train the model to predict.
In the screenshot below, I have used the training column "MEAN_Transactions_standard_cost" for the prediction of the Sequential Model
I proceeded to use EPOCH = 1000 to train the model.
Using tensorflow, I used the plotter to show the mean absolute error (MAE) vs the number of epochs
I then used the plotter to show the mean square error (MSE) vs the number of epochs
I experimented with "keras" early stopping to see if there are any improvements in the predictions.
After repeating the process of training and testing for the rest of the columns. I have obtained the following dataframe
I gathered all unique values and seperate them by interval ranges using groupby dataframe.This is an example for the column "past_3_years_bike_related_purchase
I then groupy each of the predicted columns based on their "sum" and "mean" and obtained this final dataframe as shown
Finally, I repeated the process for the different customer demographics for the columns "Gender_aggregate", "job_industry_category_aggregate", "wealth_segment_aggregate", "state_aggregate", "3_years_bike_purchases_agg", "Age_agg", "tenure_agg", "property_valuation_agg" and exported the dataframes as excel sheet
This is a screenshot of the column "Gender_aggregate"
In the last task, after performing customer profiling in the dataset, I built an interactive Tableau dashboard highlighting key trends in datasets on customer preferences and propensity to the client and propose growth strategies to improve customer targeting.
I have formed 3 dashboards, mainly "Social Dashboard", "Economic Dashboard" and "Australia Dashboard"
The social dashboard plots a faceted graph of the columns "Category by Age", "Category by Gender" and "Category by Industry Category"
The economic dashboard plots a faceted graph of the columns "past_3_years_bike_related_purchases" , "wealth_segment", "tenure", "property_valuation".
The last dashboard shows the numerical value of each of the columns "MEAN_Transactions_standard_cost", "MEAN_Transactions_list_price", "MAX_Transactions_standard_cost", "MAX_Transactions_list_price", "MIN_Transactions_standard_cost","MIN_Transactions_list_price", "STD_Transactions_standard_cost", "STD_Transactions_list_price", "SKEW_Transactions_standard_cost" ,"SKEW_Transactions_list_price", "COUNT_Transactions" in the 3 regions of Australia