joemarlo / NYC-data

An NYC transportation database for analyses

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NYC-data

A database of the various NYC transportation options built to ease and expedite future analyses. Currently, the database includes Citi Bike and subway data, and is approximately 20gb representing 140mm rows.

Create_database.R creates the SQLite database of the Citi Bike, subway, and (eventually) taxi trip data. Shell scripts in each folder must be run first to download the data. Individual files to clean and analyze the data from the database are in the folders: Citi-bike, Subway-turnstiles, Taxi.

Once the database is created, data can easily be accessed via SQL and dbplyr queries:

# establish the connection to the database
conn <- dbConnect(RSQLite::SQLite(), "NYC.db")

# query and mutate on-disk
turnstile.df <- tbl(conn, "turnstile.2019")
turnstile.df %>%
  select(Station, Time, Entries, Exits) %>%
  group_by(Station) %>%
  summarize(Entries = sum(Entries),
            Exits = sum(Exits))

# or pull data into memory and then treat as a standard data frame
turnstile.df <- tbl(conn, "turnstile.2019") %>% collect() 

Analyses utilizing the database

To-do list

  • Build database core
  • Add in Citi Bike data to database
  • Add in Subway data to database
  • Add in taxi data to database
  • Add lat/long information for each subway station
  • Ensure date time formats are consistent across tables (note: use as_date() or as_datetime() to convert the values from queries)
  • Create example of modeling on-disk
  • Create example visualizations
  • Add Central Park weather to database

Visualizations created from the database

About

An NYC transportation database for analyses


Languages

Language:R 74.5%Language:JavaScript 25.4%Language:Shell 0.2%