kirajano/akb

In this project, we tackle the Kaggle Rossman challenge. The goal is to predict the Sales of a given store on a given day. Submissions are evaluated on the root mean square percentage error (RMSPE).

The dataset consists of two csv files: store.csv and train.csv. Data Files:

train.csv holds info about each store.
store.csv holds the sales info per day for each store.
holdout. csv holds "unseen" data that the model is going to be evaluated on

Script Files: The repo contains main.py that runs the main script from step one until the end. The script can be run after cloning since all data used is in the repo. By default, the hyperparameter section is uncommented due to long completion time. The script can be ran individually and the last print out will be the RMSPE for the predictions of the holdout set.

The function.py file contains utility functions that are called in main.py.

The rossman_model.sav contains the pickled hypertuned model with the least RMSPE.

The single steps of training the model are as follows:

Exploring data: EDA and visualization
Cleaning data:
- drop data with no store
- drop data with no DayOfWeek
- drop data when store in NOT open
- drop data where promo is NaN
- drop SchoolHoliday data where promo is NaN
- drop parameters that don't seem useful:
  ('CompetitionOpenSinceMonth','CompetitionOpenSinceYear',/ 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval')
- drop all rows with NaNs - Approximately 3% of rows
- convert all the columns to int when necessary
Encoding:
- add Month as dummies
- add a feature for scaled CompetitionDistance
- convert DayOfWeek to dummies
- convert StateHoliday to dummies
- convert StoreType to dummies
- convert Assortment to dummies
Looking at the correlations of the Sales with different parameters. Sales have significant correlation with:
- Customers
- DayOfWeek
- StateHoliday
- StoreType
- Scaled CompetitionDistance etc.
Test / Train Split
Baseline Model and RandomForestRegression
Feature Selection and Engineering
Training data via Pipelines (RandomForestRegression, KNR and XGBoost Regressors)
Hyperparameter Tuning of the best model from step 8

kirajano / akb

About

Languages