🏪💊 ROSSMANN STORE SALES PREDICTION 💊🏪

📖 Background

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

📌 Problem Statement

Forecasting the sales' income of the next 6 weeks.

Since the recent results's accuracy are quite noisy, our work here is to give an assertive prediction of the sales of each store, up to 6 weeks in advance. This task has been assined to the whole team of Data Scientists, who are given a historical database in order to generate the desired forecasting. To catch up all details of the request, the team had a business meeting with company's CFO, who explained the need to establish a budget to carry out a general repair in each store.

💾 Data Understanding

- Code Used:

Python Version : 3.8
Packages : Jupyter, Pandas, Numpy, Matplotlib, Seaborn, Scikit-Learn among others (please, check full list here)
Frontend API: Telegram Bot
Backend: Heroku

- Importing Dataset:

Kaggle: https://www.kaggle.com/competitions/rossmann-store-sales/data

- Data Dictionary

Variable	Descriptions
Id	An id that represents a (store, date) duple within the test set
Store	A unique id for each store
Sales	The turnover for any given day
Customers	The number of customers on a given day
Open	An indicator for whether the store was open: 0 = closed, 1 = open
Stateholiday	Indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. A = public holiday, b = easter holiday, c = christmas, 0 = none
Schoolholiday	Indicates if the (store, date) was affected by the closure of public schools
Storetype	Differentiates between 4 different store models: a, b, c, d
Assortment	Describes an assortment level: a = basic, b = extra, c = extended
Competitiondistance	Distance in meters to the nearest competitor store
Competitionopensince[month/year]	Gives the approximate year and month of the time the nearest competitor was opened
Promo	Indicates whether a store is running a promo on that day
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2since[year/week]	Describes the year and calendar week when the store started participating in promo2
Promointerval	Describes the consecutive intervals promo2 is started, naming the months the promotion is started anew. E.G. "Feb,may,aug,nov" means each round starts in february, may, august, november of any given year for that store

- Business Assumptions

Stores with sales equal 0 were discarded.
Days when stores were closed were discarded.
Stores missing information about "Competition Distance" will be set a value of '200000' distance.
Analysis will be made one week after last date record.

🧾 Evaluation Metric

We used many error metrics during the project. Main model metric for evaluation was the Root Mean Squared Error (RMSE). The RMSE is calculated as

where y_hat is the predicted value, y is the ground truth value, and n is the number of stores in the dataset.

🔬 Solution Approach

The approach used to solve this task was done by applying CRISP-DM¹ methodology, which was divided in the following parts:

Data Description: understanding of the status of the database and dealing with missing values properly. Basic statistics metrics furnish an overview of the data.
Feature Engineering: derivation of new attributes based on the original variables aiming to better describe the phenomenon that will be modeled, and to supply interesting attributes for the Exploratory Data Analysis.
Feature Filtering: filtering of records and selection of attributes that do not contain information for modeling or that do not match the scope of the business problem.
Exploratory Data Analysis (EDA): exploration of the data searching for insights and seeking to understand the impact of each variable on the upcoming machine learning modeling.
Data Preparation: preprocessing stage required prior to the machine learning modeling step.
Feature Selection: selection of the most significant attributes for training the model.
Machine Learning Modeling: implementation of a few algorithms appropriate to the task at hand. In this case, models befitting the regression assignment - i.e., forecasting a continuous value, namely sales.
Hyperparameter Fine Tuning: search for the best values for each of the parameters of the best performing model(s) selected from the previous step.
Statistical Error Analysis: conversion of the performance metrics of the Machine Learning model to a more tangible business result.
Production Deployment: deployment of the model in a cloud environment (Heroku), using Flask connected to our model in a pickle file.
Telegram Bot: deployment of Telegram Bot API, here used as our user receiver. Check out at "Deployment" section.

🕵🏽‍♂️ Exploratory Data Analysis & Main Insights

- Numerical Attributes Correlation

- Categorical Attributes Correlation

- Main Hypothesis Chosen

Here, the criterion used to choose the main hypotheses was in the sense of how shocking and impacting the result would be for the business team's beliefs.

Hypothesis 1 (H2 in notebook): Stores with closer competitors should sell less. False: Data showed us that, actually, they sell MORE.

Business team's belief revealed us their thoughts on lower sales while drugstores are closer to the competitors. This hypothesis proves the opposite The correlation analysis of "competition distance" and "sales" shows a small correlation, indicating that sales do not increase when competitors are closer.
Hypothesis 2 (H4 in notebook): Stores with longer active offers should sell more. False: Data showed us that, stores that kept products on sale for a long time performed worse than before.

Again, we shocked business team's belief and common sense that. Here are the visualizations.
Hypothesis 3: Stores should sell more after the 10th day of each month.
True: the average performance is better after 10 days of the month.

💻 Machine Learning Modeling & Evaluation

Cross Validation

Performance on 5 K-Fold CV.

Model Name	MAE CV	MAPE CV	RMSE CV
Random Forest Regressor	837.68 +/- 218.74	0.12 +/- 0.02	1256.08 +/- 320.36
XGBoost Regressor	1039.91 +/- 167.19	0.14 +/- 0.02	1478.26 +/- 258.52
Linear Regression	2081.73 +/- 295.63	0.3 +/- 0.02	2952.52 +/- 468.37
Linear Regression - Lasso	2116.38 +/- 341.50	0.29 +/- 0.01	3057.75 +/- 504.26

Although the Random Forest model has proven to be superior to the others, in some cases this model ends up requiring a lot of space to be published, resulting in an extra cost for the company to keep it running. Therefore, the chosen algorithm was the XGBoost Regressor which in sequence passed to the Hyperparameter Fine Tunning step.

Final Model (after Hyperparameter Fine-Tuning)

Model Name	Mean Absolute Error	Mean Absolute Percentage Error	Root Mean Squared Error
XGBoost Regressor	699.43	0.1037	1005.6039

📉 Business Performance

According to our forecasting model, we achieved an efficiency improvement of 48.37% compared to previous forecasts (Average Model had 1354.8 for MAE and our new model has 699.43). Translating into business terms, we calculate the sum of worst and best revenue scenarios, and the respective forecasts made.

Scenario	Values
Predictions	R$285,934,117.20
Worst Scenario	R$285,150,484.70
Best Scenario	R$286,717,749.69

As we can see, our best scenario and worst scenario only diverge from predictions by 0.27%: certainly more assertive than previous one.

💡 Conclusions

XGBoostRegressor model had the best performance when it comes to Results/Time * Accuracy, and thus gave us a more assertive prediction, helping our CFO on taking futures decisions about budget and repairing the stores.

👣 Next steps

DS team establish to start another cycle to analyze the problem, seeking different approaches, creating another hypothesis, reconsidering the ones not chosen, and reanalysing stores with behavior that were tough to do the forecast. Some approachs in mind to be made:

Collect more data;
Change aggregate parameter "sum" to "mean" of all stores for the assortment hypothesis. 'Extended assortment' will probably perform better than the others;
Refine the feature engineering, trying to find another good features;
Work with GridSearchCV, as we have more time to tune our model, since its already in production; and many more.

🚀 Deployment

Go say 'Hi!' to our bot! Check it out at:

Sign up in Telegram;
Submit one number at a time and wait for prediction!
or just look for 'rossmannleassis_bot' in Telegram's search!

🔗 References

Data Science Process Alliance - What is CRISP-DM
Owen Zhang - Open Source Tools & Data Science Competitions
Statstest - Cramer's V
Bulldogjob - How to write a Good readme

If you have any other suggestion or question, feel free to contact me via LinkedIn.

✍🏽 Author

Leandro Destefani

💪 How to contribute

Fork the project.
Create a new branch with your changes: git checkout -b my-feature
Save your changes and create a commit message telling you what you did: git commit -m" feature: My new feature "
Submit your changes: git push origin my-feature

leassis91 / rossmann_store