sarangof / Bus-Capstone

MTA bus reliability metrics and data quality assessment for SIRI - Bus Time API data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bus-Capstone

Capstone project for NYC Department of Transportation.

Important documentation:

Final Product Sample

Darker means poorer on time performance for the buses

alt text

Open Interactive Map in Carto Map

ETL procedure

  1. Bus Time data (use siri_tools)

  2. Scrape: Query the Bus Time API every 60 seconds and write each JSON response to a local file. It is recommended to run two independent scrape processes (separated by 30 seconds) to get maximum data density. This minimizes the interruptions from some responses taking longer than 30 seconds.
    Requirements:
    * MYKEY file located in the OS working directory containing a single text string. See Bus Time documentation for instructions on getting a key. * jsons/ directory exists in the OS working directory

  3. Parse: Extract useful data elements from each vehicle record in each JSON response file. Takes roughly one second to parse one JSON, so an entire day's worth data may take up to 15 minutes. Speed is significiantly faster using the Spark code.

  4. Clean: Using schedule data as the "truth" source, filter extracted and parsed Bus Time data to exclude any records where the reported "next stop" is invalid for the reported trip_id.

  5. Schedule data

  6. Download: Static feeds of the current schedule data for each borough (plus the MTA Bus Company) are available directly from the MTA. Historical feeds are available through a third-party open-source project. Shell script to download all previous feeds in one batch can be found in the [Bus Viz github] (https://github.com/efranco63/NYU_USI_BusViz/blob/master/TransitFeeds/fetch.sh).

  7. Generate metadata (list of date ranges): Use method gtfs.build_metadata(dpath) to generate a small text file within each subdirectory of dpath that lists the valid date ranges of each included feed. This is necessary since schedule data changes periodically, so any schedule-comparison analysis must use only data extracted from the corresponding concurrent feed.
    Requirements: * All downloaded transit feed files must be in their original standard format (zip) * Each feed gets its own subdirectory, containing current and prior feeds
    Example directory structure for GTFS data

gtfs/  
  80_brooklyn/  
    metadata.txt  
    gtfs_brooklyn_1383136207.zip  
    gtfs_brooklyn_1419914436.zip  
    gtfs_brooklyn_1386879331.zip  
    gtfs_brooklyn_20150402.zip  
  82_manhattan/  
  84_staten_island/  
  81_bronx/  
  83_queens/  
  85_bus_company/
  1. Stop time estimation

  2. Recommended: Linear interpolation (see demonstration notebook)

  3. Alternative: Spatial search (see demonstration script)

  4. Performance metrics

  5. Recommended: Generate a single measurement for each route at the stop with the most data (see demonstration notebook)

  6. Alternative: Batch process metrics for all stops, routes and dates before filtering and analyzing (use metrics.py)
    Example: python metrics.py dec2015_interpolated.csv gtfs/ dec2015_metrics.csv

About

MTA bus reliability metrics and data quality assessment for SIRI - Bus Time API data.


Languages

Language:Jupyter Notebook 92.9%Language:TeX 5.7%Language:Python 1.4%Language:Shell 0.0%