vahadruya / Capstone_Regression_NYC_Taxi_Trip_Duration_Prediction

This project aims to predict the Taxi-trip duration within NYC based on several factors as predictors. Various combinations of relevant features are explored as iterations. After analysing the dataset, important and necessary features are selected. Several regression models are implemented & evaluated based on R2 & RMSE, & predictions visualised

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NYC Taxi Trip Duration Prediction

Table of Contents
  1. About the Project
  2. Dataset Description
  3. Feature Engineering and Data Pre-processing
  4. Model Implementation
  5. Model Evaluation and Results
  6. Conclusion
  7. Libraries Used
  8. Contact

About the Project

The NYC taxi trip duration dataset is a dataset released by the NYC Taxi and Limousine Commission, which includes several features like pickup time, dropoff time, pickup coordinates etc as possible predictors for prediction of taxi trip duration. The aim of this project is to accurately predict the trip duration using a regression model, using (but not limited to) the above features. This project also focuses on selecting the best combination of input features for accurate prediction, by combining both logical reasoning and iterations.

Dataset Description

The dataset consists of various information as features in relation to a given taxi trip. They are:

  • id - a unique identifier for each trip
  • vendor_id - a code indicating the provider associated with the trip record
  • pickup_datetime - date and time when the meter was engaged
  • dropoff_datetime - date and time when the meter was disengaged
  • passenger_count - the number of passengers in the vehicle (driver entered value)
  • pickup_longitude - the longitude where the meter was engaged
  • pickup_latitude - the latitude where the meter was engaged
  • dropoff_longitude - the longitude where the meter was disengaged
  • dropoff_latitude - the latitude where the meter was disengaged
  • store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip

Target variable to predict:

  • trip_duration - duration of the trip in seconds

Feature Engineering and Data Pre-processing

  • Initially, the pickup and dropoff datetime features are converted into more meaningful variables such as the day of the week, day of the month, month of the year, and hour of the day. This allows for a better understanding of how the trip duration varies across different time periods, and also create new features which can establish temporal (daily, weekly and monthly) relationships with the trip duration.
  • Additionally, outliers of trip duration are handled by trimming extreme values and filtering pickup and dropoff coordinates within (and the near proximity of) NYC boundaries.
  • To select the combination of the most relevant features, separate datasets are created by considering different combinations of a few features of interest - namely, passenger count, store_and_fwd_flag, and holidays. Each of these datasets are then fit into regression models as separate iterations, which allows for a comparison of their impact on the predicting power of the models.
  • Further, the continuous variables are transformed using the appropriate transformations to ensure normal distribution of the residues. The features are scaled by applying standard scaling to ensure consistent scaling across features.

Model Implementation

The scaled dataset is split into train and test dataset based on an appropriate test ratio. Seven different regression models are then implemented onto this dataset roughly in the order of increasing complexity, namely

  1. Linear Regression
  2. Lasso regularized linear model
  3. Ridge regularized linear model
  4. Polynomial regression
  5. Light gradient-boosting machine
  6. Decision Trees
  7. XGBoost

The 2 best models out of these are then stacked together to test for further improvement.

Model Evaluation and Results

The combination of RMSE, R2 and adjusted R2 are chosen as the appropriate metrics for evaluation of the regression models. The values of these metrics along with the model runtime (in seconds) for the regression from one particular iteration of input dataset were:

Model Train RMSE Train R2 Train adj_R2 Test RMSE Test R2 Test adj_R2 Runtime (s)
Linear Regression 28.321850 0.581448 0.581444 28.311827 0.583317 0.583300 0.427053
L1 regularized LR 28.321850 0.581448 0.581444 28.311829 0.583317 0.583300 6.543669
L2 regularized LR 28.321850 0.581448 0.581444 28.311834 0.583317 0.583300 3.810947
Polynomial Regression 23.418874 0.653906 0.653882 23.406192 0.655516 0.655419 5.429366
Decision Trees 13.972283 0.793512 0.793510 18.102460 0.733574 0.733564 246.243221
LightGBM 12.051112 0.821904 0.821902 13.053452 0.807884 0.807876 379.578703
XGBoost 5.952393 0.912033 0.912032 12.687642 0.813268 0.813260 1634.325324
Stacking 22.766780 0.663543 0.663540 25.829009 0.619858 0.619843 1986.644313

On the basis of Test RMSE and Test R2 scores, the XGBoost edges out the LightGBM in performance, which has the maximum R2 and minimum RMSE. While the LightGBM had much lower model training time, it gave out feature importances not in accordance with the other models. Hence, the XGBoost is chosen as the best regression model.

Conclusion

  • Overall, the XGBoost proved to be the most productive model for prediction of taxi trip durations. distance was adjudged to be the most important predictor, while vendor_id was concluded to be the least important one.than the other models in this context. Local explanation of the XGBoost using ELI5 also provided reasonable results.
  • On the iteration of input datasets by combination of certain relevant features - the dataset which dropped passenger_count and store_and_fwd_flag (due to their high class imbalance and feature prediction redundancy with respect to trip_duration) and included holiday as a predictor proved to be best input for the regression models, producing the best test metric scores.

Libraries Used

For handling and manipulating data

Pandas Numpy

For Visualisation

Matplotlib Seaborn IPython Graphviz GeoPandas folium

For Hypothesis testing, Pre-processing and Model training

Statsmodels SciPy Scikit Learn ELI5

Contact

Linkedin Gmail

About

This project aims to predict the Taxi-trip duration within NYC based on several factors as predictors. Various combinations of relevant features are explored as iterations. After analysing the dataset, important and necessary features are selected. Several regression models are implemented & evaluated based on R2 & RMSE, & predictions visualised


Languages

Language:Jupyter Notebook 100.0%