Bus-Data-NYC / mta-bus-archive

Scrape and archive realtime bus position data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mta bus archive

Download archived NYC MTA bus position data, and scrape gtfs-realtime data from the MTA.

Bus position data for July 2017 forward is archived at https://s3.amazonaws.com/nycbuspositions. Archive files follow the pattern https://s3.amazonaws.com/nycbuspositions/YYYY/MM/YYYY-MM-DD-bus-positions.csv.xz, e.g. https://s3.amazonaws.com/nycbuspositions/2017/07/2017-07-14-bus-positions.csv.xz.

Requirements:

  • Python 3.x
  • PostgreSQL 9.5+

Set up

Specify your connection parameters using the standard Postgres environment variables:

PGDATABASE=dbname
PGUSER=myuser
PGHOST=myhost.com

You may skip this step if you're using a socket connection to your user's database.

Initiation

This command will create a number of whose tables that begin with rt_, notably rt_vehicle_positions, rt_alerts and rt_trip_updates. It will also install the Python requirements, including the Google Protobuf library.

make install

Download an MTA Bus Time archive file

Download a (UTC) day from data.mytransit.nyc, and import into the Postgres database dbname:

make -f download.mk download DATE=2016-12-31

Scraping

Scrapers have been tested with Python 3.4 and above. Earlier versions of Python (e.g. 2.7) won't work.

Scrape

The scraper depends assumes an environment variable, BUSTIME_API_KEY, contains an MTA BusTime API key. Get a key from the MTA.

export BUSTIME_API_KEY=xyz123

Download the current positions from the MTA API and save a local PostgreSQL database named mtadb:

make positions

Download current trip updates:

make tripupdates

Download current alerts:

make alerts

Scheduling

The included crontab shows an example setup for downloading data from the MTA API. It assumes that this repository is saved in ~/mta-bus-archive. Fill-in the PG_DATABASE and BUSTIME_API_KEY variables before using.

Uploading files to Google Cloud

Setup

Create a project in the Google API Console. Make sure to enable the "Google Cloud Storage API" for your application. Then set up a service account. This will download a file containing credentials named something like myprojectname-3e1f812da9ac.json.

Then run the following (on the machine you'll be using to scrape and upload) and follow instructions:

gsutil config -e

Next, create a bucket for the data using the Google Cloud Console.

You've now authenticated yourself to the Google API. You'll now be able to run a command like:

make -e gcloud DATE=2017-07-14 PG_DATABASE=mydbname

By default, the Google Cloud bucket will have the same name as the database. Use the variable GOOGLE_BUCKET to customize it.

License

Available under the Apache License.

About

Scrape and archive realtime bus position data


Languages

Language:Python 61.8%Language:Makefile 24.7%Language:PLpgSQL 13.4%