Image source: https://vc4a.com/ventures/sendy-limited/
Team 2 - Mangaliso Samuel Makhoba, Bryan Green, Michael Ilic, Lawrence Hlapa, Faatimah Mansoor The structure of this notebook is as follows:
Sendy is a logistic company in Kenya. The aim of this project is to build a regression model for Sendy which can accurately predict delivery time, from the time a package is picked up to its arrival at the final destination.
To build this model, the Train dataset and the riders dataset will used. Regression models will be trained, and the most suitable will be selected. This model will then be used to predict delivery time for the test dataset
Modules to be imported in jupyter notebook: import modules import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns import plotly.express as px import math
-
Train - Dataset training our model train = pd.read_csv('/content/Train.csv')
-
Test - Dataset testing our model (y variables to be predicted) test = pd.read_csv('/content/Test.csv')
-
Riders - Riders info riders = pd.read_csv('/content/Riders.csv')
-
Variable Definitions - Info about columns and there values
variable_def = pd.read_csv('/content/VariableDefinitions.csv') -
Sample - Sample results of competition sample = pd.read_csv('/content/SampleSubmission.csv')
NB: We used jupyter notebook to run codes to import.
- Vehicle type (although there is only one - "bike")
- Platform type
- Personal or Business, referring to the business type
- Date and day - however they should be represented as cyclical
- Temperature
- Figure 1. Number of missing values: null values per variable in the 'train' dataset.
- Figure 2. Temperature distribution: temperature distribution from the'train' dataset.
- Figure 3. Overall data distribution: distribution of each of the variables in the 'train' dataset, including the y variable (Time from pickup to arrival)
- Figure 4. Checking correlations: correlations heat map for the variables in the 'train' dataset.
- Figure 5. Time from pickup to arrival: graphical representation of the distribution of pickup location.
We started by analyzing the data and finding the columns that had too many missing values, or had no relevance to the final prediction.
The next step was to split the data into the train and test sets, in order to prevent overfitting and to validate the effectiveness of the model.
We built a custom cleaner function to drop nulls and format the data in a way that made it usable for model training.
We applied cross validation to confirm that the data was valid and finally ready for training.
Linear, Lasso and Ridge Regression Decision Tree Regressor Random Forest, Gradient Boosting, Bagging, AdaBoost Regressors XGB, LGBM, Cat Boost Regressors
The moment we’ve all been waiting for!
We trained 6 different regression models and discovered that CatBoostRegressor returned the highest score, with a value of 720.82.
- RMSE was used to select the best model
- CatBoostRegressor provided the best RMSE score of 720.82
- We found that temperature had no effect on the delivery time
After testing 6 different models, CatBoostRegressor proved to be the most effective at accurately predicting delivery time.