Bus-Capstone

Capstone project for NYC Department of Transportation.

Important documentation:

Documentation of data processing and Spark
- Python Data Processing
  - Demonstration: Ipython Notebooks that demonstrate all the processes
  - Core Modules
    1. Siri Tools: Modules for Bus Time data retrieval and cleaning
    2. Time Tools: Homemade Timedelta Converter
    3. GTFS: Extract the Schedule Data from GTFS Schedules(originally in ZIP)
    4. Arrival Time: Estimate the arrival time for each stop using Scipy KD-Tree and Interpolate
    5. Performance Metrics: Calculate common performance measurements on each route, stop and date.
- Big Data with SPARK
  - For details, check the Spark folder.
Sponsor report
Technical report

Final Product Sample

Darker means poorer on time performance for the buses

Open Interactive Map in Carto Map

ETL procedure

Bus Time data (use siri_tools)
Scrape: Query the Bus Time API every 60 seconds and write each JSON response to a local file. It is recommended to run two independent scrape processes (separated by 30 seconds) to get maximum data density. This minimizes the interruptions from some responses taking longer than 30 seconds.
Requirements:
* MYKEY file located in the OS working directory containing a single text string. See Bus Time documentation for instructions on getting a key. * jsons/ directory exists in the OS working directory
Parse: Extract useful data elements from each vehicle record in each JSON response file. Takes roughly one second to parse one JSON, so an entire day's worth data may take up to 15 minutes. Speed is significiantly faster using the Spark code.
Clean: Using schedule data as the "truth" source, filter extracted and parsed Bus Time data to exclude any records where the reported "next stop" is invalid for the reported trip_id.
Schedule data
Download: Static feeds of the current schedule data for each borough (plus the MTA Bus Company) are available directly from the MTA. Historical feeds are available through a third-party open-source project. Shell script to download all previous feeds in one batch can be found in the [Bus Viz github] (https://github.com/efranco63/NYU_USI_BusViz/blob/master/TransitFeeds/fetch.sh).
Generate metadata (list of date ranges): Use method gtfs.build_metadata(dpath) to generate a small text file within each subdirectory of dpath that lists the valid date ranges of each included feed. This is necessary since schedule data changes periodically, so any schedule-comparison analysis must use only data extracted from the corresponding concurrent feed.
Requirements: * All downloaded transit feed files must be in their original standard format (zip) * Each feed gets its own subdirectory, containing current and prior feeds
Example directory structure for GTFS data

gtfs/  
  80_brooklyn/  
    metadata.txt  
    gtfs_brooklyn_1383136207.zip  
    gtfs_brooklyn_1419914436.zip  
    gtfs_brooklyn_1386879331.zip  
    gtfs_brooklyn_20150402.zip  
  82_manhattan/  
  84_staten_island/  
  81_bronx/  
  83_queens/  
  85_bus_company/

Stop time estimation
Recommended: Linear interpolation (see demonstration notebook)
Alternative: Spatial search (see demonstration script)
Performance metrics
Recommended: Generate a single measurement for each route at the stop with the most data (see demonstration notebook)
Alternative: Batch process metrics for all stops, routes and dates before filtering and analyzing (use metrics.py)
Example: python metrics.py dec2015_interpolated.csv gtfs/ dec2015_metrics.csv

sarangof / Bus-Capstone

Bus-Capstone

Important documentation:

Final Product Sample

ETL procedure

About

Languages