VatsalP / Data-Pipeline-NY-311-Service-Requests

Data Pipeline in Google Cloud Platform for New York City 311 Service Requests dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Pipeline to process New York City 311 Service Requests

Dataset:

NY 311 Service Requests is hosted by the City of New York. NYC311’s mission is to provide the public with quick, easy access to all New York City government services and information while offering the best customer service.

The csv dataset is over 10 GB.

Pipeline

The Pipeline uses Google Cloud Platform services to process the dataset in efficient and scalable way!

Data Pipeline

The pipeline consists of:

  • Two Cloud Storage buckets
    • Processing Bucket which will trigger the start of the pipeline
    • Processed Bucket which will have two folders, one for failure and one for success
  • Two Cloud Functions
  • Cloud Composer environement that is the core of the pipeline and which starts Cloud Dataflow job for processing the dataset
  • Cloud Pub/Sub topic for email notification, which will trigger the second cloud function
  • BigQuery used as Data Warehouse, dataflow job writes the processed data to BigQuery table

Airflow variables to be set:

gcs_completion_bucket -> Processed bucket name

gcp_temp_location -> For data flow, set some bucket location

gcp_project -> Project name

pubsub_topic_email -> Pub/Sub Topic name that will trigger second cloud function

bq_output_table -> BigQuery table name


input_field_names -> unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,bbl,borough,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location

Triggering the pipeline

Just drop csv dataset in the Processing bucket and we are good to go! You can use the resulting BigQuery table to gain valuable insights from the 311 service requests about New York City like what is the Noisest Residential street in New York City. (it is EAST 231 STREET according to the complaints recieved)

You can also connect BigQuery as source on DataStudio and make an interactive dashboard. PDF of the one I made.

One of the page from the dashboard, look at the spike towards the end - likely due to COVID and more people staying at home. data_studio

About

Data Pipeline in Google Cloud Platform for New York City 311 Service Requests dataset


Languages

Language:Python 100.0%