PamelaKelly / ETADataHandling

Data Cleaning and Modelling for ETA Dublin Bus Travel Time Predictor. Uses linear regression to create a predictive model.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ETA_Data_Handling

Data Cleaning and Modelling for ETA Dublin Bus Project

Github Repository: https://github.com/PamelaKelly/ETA_Data_Handling

Please Note: Datasets are not included. If you would like to run any of the scripts locally please contact us for the datasets.

Guide to Folders:

  1. Data Cleaning: Consolidation of all data cleaning processes into one place.

Core scripts used: Data_Cleaning_2012_Week1.ipynb - Final data cleaning file for first week of data. Data_Cleaning_Process_ETA.ipynb - the final file for cleaning the full data set data_cleaning.py - Deprecated - behind Jupyter Notebooks.

Scripts/ data_cleaning.py - an object oriented version of the data_cleaning process - not used for final process data_quality.py - check null values have been removed feature_engineering.py - an object oriented version of the feature engineering process - not used for final process test_data_cleaning.py - used to debug during the data cleaning process test_feature_engineering.py - used to debug during the feature engineering process

stop_lookup/ stop_lookup.py - implements nearest neighbour search functionality for stops __route_tree_builder.py - builds the trees that represent route information for quick searching test_stop_lookup.py - tests the stop_lookup module

  1. Data Modelling

Modelling_all_data.ipynb - the final modelling process for all of the data Modelling_Every_Journey_Pattern_ID.ipynb - modelling the individual journey patterns Modelling_For_One_Week_Data.ipynb - Testing a number of models (SVM, ANN, RFR, LR) on a subset of the first week of data. Modelling_LinearRegression_RandomForestRegressor.ipynb - Testing and Comparing these two models on the first week of data.

  1. Data Quality

data_quality_one_week_v1.ipynb - first iteration of data quality on one week of data (first week Nov) data_quality_report_one_week_v2.ipynb - second iteration of data quality on one week of data (first week Nov) data_quality_report3_all_data.ipynb - third iteration of data quality on full dataset.

  1. Database Scripts

database_manager.py - handles setting up the database details and connections - abstract from specifics of project database_setup.py - handles creating the schema, and populating the database specifically for this project queries.py - testing ORM queries for front end to try to help with optimization of queries.

tests/ test_database_manager.py - test the database_manager script

  1. Static Data Prep

stop_lookup/ Helper folder to enable nearest neighbour lookups - duplicated due to import issues. 2012_stops.ipynb - parses and preps the 2012 stop information, including conversion of location coordinates. Get_Timetable.ipynb - Extracts timetables from raw data. map_timetables_journey_patterns.py - attempts to map timetables scraped from web archive to journey pattersn - unsuccessful nearest_stop.py - adds nearest stop to dataset in order to aid routes.py routes.py - extracts route information from raw data stops.py - parses 2017 stop data. timetable_scraper - scraped 2012 timetables from web archive

About

Data Cleaning and Modelling for ETA Dublin Bus Travel Time Predictor. Uses linear regression to create a predictive model.


Languages

Language:Jupyter Notebook 96.9%Language:Python 3.1%