anova-test chi-square-test decision-tree-regression eli5 folium geopandas lightgbm nyc-taxi-dataset pandas python regression sklearn stacking-ensemble xgboost

NYC Taxi Trip Duration Prediction

Table of Contents

About the Project
Dataset Description
Feature Engineering and Data Pre-processing
Model Implementation
Model Evaluation and Results
Conclusion
Libraries Used
Contact

About the Project

The NYC taxi trip duration dataset is a dataset released by the NYC Taxi and Limousine Commission, which includes several features like pickup time, dropoff time, pickup coordinates etc as possible predictors for prediction of taxi trip duration. The aim of this project is to accurately predict the trip duration using a regression model, using (but not limited to) the above features. This project also focuses on selecting the best combination of input features for accurate prediction, by combining both logical reasoning and iterations.

(back to top)

Dataset Description

The dataset consists of various information as features in relation to a given taxi trip. They are:

id - a unique identifier for each trip
vendor_id - a code indicating the provider associated with the trip record
pickup_datetime - date and time when the meter was engaged
dropoff_datetime - date and time when the meter was disengaged
passenger_count - the number of passengers in the vehicle (driver entered value)
pickup_longitude - the longitude where the meter was engaged
pickup_latitude - the latitude where the meter was engaged
dropoff_longitude - the longitude where the meter was disengaged
dropoff_latitude - the latitude where the meter was disengaged
store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip

Target variable to predict:

trip_duration - duration of the trip in seconds

(back to top)

Feature Engineering and Data Pre-processing

Initially, the pickup and dropoff datetime features are converted into more meaningful variables such as the day of the week, day of the month, month of the year, and hour of the day. This allows for a better understanding of how the trip duration varies across different time periods, and also create new features which can establish temporal (daily, weekly and monthly) relationships with the trip duration.
Additionally, outliers of trip duration are handled by trimming extreme values and filtering pickup and dropoff coordinates within (and the near proximity of) NYC boundaries.
To select the combination of the most relevant features, separate datasets are created by considering different combinations of a few features of interest - namely, passenger count, store_and_fwd_flag, and holidays. Each of these datasets are then fit into regression models as separate iterations, which allows for a comparison of their impact on the predicting power of the models.
Further, the continuous variables are transformed using the appropriate transformations to ensure normal distribution of the residues. The features are scaled by applying standard scaling to ensure consistent scaling across features.

(back to top)

Model Implementation

The scaled dataset is split into train and test dataset based on an appropriate test ratio. Seven different regression models are then implemented onto this dataset roughly in the order of increasing complexity, namely

Linear Regression
Lasso regularized linear model
Ridge regularized linear model
Polynomial regression
Light gradient-boosting machine
Decision Trees
XGBoost

The 2 best models out of these are then stacked together to test for further improvement.

(back to top)

Model Evaluation and Results

The combination of RMSE, R² and adjusted R² are chosen as the appropriate metrics for evaluation of the regression models. The values of these metrics along with the model runtime (in seconds) for the regression from one particular iteration of input dataset were:

Model	Train RMSE	Train R²	Train adj_R²	Test RMSE	Test R²	Test adj_R²	Runtime (s)
Linear Regression	28.321850	0.581448	0.581444	28.311827	0.583317	0.583300	0.427053
L1 regularized LR	28.321850	0.581448	0.581444	28.311829	0.583317	0.583300	6.543669
L2 regularized LR	28.321850	0.581448	0.581444	28.311834	0.583317	0.583300	3.810947
Polynomial Regression	23.418874	0.653906	0.653882	23.406192	0.655516	0.655419	5.429366
Decision Trees	13.972283	0.793512	0.793510	18.102460	0.733574	0.733564	246.243221
LightGBM	12.051112	0.821904	0.821902	13.053452	0.807884	0.807876	379.578703
XGBoost	5.952393	0.912033	0.912032	12.687642	0.813268	0.813260	1634.325324
Stacking	22.766780	0.663543	0.663540	25.829009	0.619858	0.619843	1986.644313

On the basis of Test RMSE and Test R² scores, the XGBoost edges out the LightGBM in performance, which has the maximum R² and minimum RMSE. While the LightGBM had much lower model training time, it gave out feature importances not in accordance with the other models. Hence, the XGBoost is chosen as the best regression model.

(back to top)

Conclusion

Overall, the XGBoost proved to be the most productive model for prediction of taxi trip durations. distance was adjudged to be the most important predictor, while vendor_id was concluded to be the least important one.than the other models in this context. Local explanation of the XGBoost using ELI5 also provided reasonable results.
On the iteration of input datasets by combination of certain relevant features - the dataset which dropped passenger_count and store_and_fwd_flag (due to their high class imbalance and feature prediction redundancy with respect to trip_duration) and included holiday as a predictor proved to be best input for the regression models, producing the best test metric scores.

(back to top)

Libraries Used

For handling and manipulating data

For Visualisation

For Hypothesis testing, Pre-processing and Model training

(back to top)

Contact

(back to top)

About

This project aims to predict the Taxi-trip duration within NYC based on several factors as predictors. Various combinations of relevant features are explored as iterations. After analysing the dataset, important and necessary features are selected. Several regression models are implemented & evaluated based on R2 & RMSE, & predictions visualised

anova-test chi-square-test decision-tree-regression eli5 folium geopandas lightgbm nyc-taxi-dataset pandas python regression sklearn stacking-ensemble xgboost

Languages

Language:Jupyter Notebook 100.0%