tmaferreira / DataEngineeringZoomCampProject

Data Engineering ZoomCamp Course Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Engineering ZoomCamp Course Project - US Accidents

Preface

This repository contains the course project for the Data Engineering Zoomcamp (Cohort 2023) organized by the by DataTalks.Club community. The project covers main data engineering skills taught in the course:

  • Workflow Orchestration: Data Lake, Prefect tool, ETL with GCP & Prefect
  • Data Warehouse: BigQuery
  • Analytics engineering: dbt (data build tool)
  • Data Analysis: Looker Studio

US Accidents Project

Dataset

US car crash dataset (covers 49 states). Crash data is collected from February 2016 to December 2021 using various APIs that provide streaming traffic incident (or event) data. These APIs transmit traffic data captured by a variety of entities, such as US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors on road networks. There are currently around 2.8 million crash records in this dataset.

The dataset has 47 columns, but for the present project I decided to select only the relevant columns for my analysis. The following columns will be used:

# Attribute Description
1 ID This is a unique identifier of the accident record.
2 Severity Shows the severity of the accident, a number between 1 and 4.
1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
3 Start_Time Shows the start time of the accident in local time zone.
4 End_Time Shows the end time of the accident in local time zone.
End time here refers to when the impact of accident on traffic flow was dismissed.
5 Description Shows the natural language description of the accident.
6 Street Shows the street name in address field.
7 City Shows the city in address field.
8 State Shows the state in address field.
9 Country Shows the country in address field.
10 Weather_Condition Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
11 Sunrise_Sunset Shows the period of day (i.e. day or night) based on sunrise/sunset.

More information about this dataset: Author blog and Kaggle

Dataset Acknowledgments

  • Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
  • Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

Architecture Diagram

Technologies Used

  • Google Cloud Platform (GCP):
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure as code (IaC)
  • dbt: Data Transformation
  • Pandas: Data Analysis & Exploration
  • Prefect: Workflow Orchestration
  • Looker Studio: Visualize Data

DW Table Structure

# Attribute Description
1 accident_id This is a unique identifier of the accident record.
2 severity_id Shows the severity of the accident, a number between 1 and 4.
1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
3 start_date Shows start date of the accident was started.
4 end_date Shows the end date of the accident was ended.
5 start_time Shows the start time of the accident in local time zone.
6 end_time Shows the end time of the accident in local time zone.
End time here refers to when the impact of accident on traffic flow was dismissed.
7 description Shows the natural language description of the accident.
8 street Shows the street name in address field.
9 city Shows the city in address field.
10 state Shows the state in address field.
11 country Shows the country in address field.
12 weather_condition Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
13 sunrise_sunset Shows the period of day (i.e. day or night) based on sunrise/sunset.

image

Partitioning and Clustering: image

  • Partition by column start_date, more specifically by year to obtain annual granularity
  • Clustering by column country to group data that have the same country value

Benefits of combining clustered and partitioned tables: Combining clustered and partitioned tables

Data visualization: Dashboards

Main Questions

  1. Which State/City/Street in US has reported most number of Accident Cases between 2016 and 2021?
  2. How are the weather conditions in most of the accident cases in US?
  3. Did most accidents occur at night or during the day?

US Crash Accidents by State, City and Street - Dashboard

image

US Crash Accidents by Severity, Weather Conditions, Day/Night and Date (Year and Month)

image

More detailed analysis of the results obtained: Data Analysis

How to reproduce this project?

Step 1: Clone this repo and install necessary requirements

  1. Clone the repo into your local machine:
git clone git@github.com:tmaferreira/DataEngineeringZoomCampProject.git
  1. Install all required dependencies into your environment
pip3 install -r requirements.txt

Step 2: Setup of GCP

  1. Create a Google Cloud Platform (GCP) free account with your Google e-mail

  2. Create a new GCP project with the name dezoomcamp-finalproject (Note: Save the assigned Project ID. Projects have a unique ID and for that reason another ID will be assigned)

  3. Create a Service Account:

    • Go to IAM & Admin > Service accounts > Create service account
    • Provide a service account name and grant the roles: Viewer + BigQuery Admin + Storage Admin + Storage Object Admin
    • Download the Service Account json file
    • Download SDK for local setup
    • Set environment variable to point to your downloaded GCP keys:
    export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
    # Refresh token/session, and verify authentication
    gcloud auth application-default login
  4. Enable the following APIs:

Step 3: Creation of a GCP Infrastructure

  1. Install Terraform
  2. Copy files (main.tf and variables.tf) for the infrastructure creation (Use files created in Zoomcamp course: Terraform files)
  3. In the file variables.tf change variable BQ_DATASET to: us_traffic_accidents_data
  4. Execute the following commands to plan the creation of the GCP infrastructure:
# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
# -var="project=<your-gcp-project-id>"

terraform plan -var="project=dezoomcamp-finalproject"
# Create new infra
# -var="project=<your-gcp-project-id>"

terraform apply -var="project=dezoomcamp-finalproject"

It is possible to see in the GCP console that the Infrastructure was correctly created.

Step 4: Setup of Kaggle API

  1. Create a Kaggle free account

  2. Create an API token:

    • Click on your avatar
    • Go to Account menu
    • Click on the option "Create New API Token"
    • Download the json file for local setup
  3. In your local setup, copy the file into the path:

~/.kaggle/
  1. For your security, ensure that other users of your computer do not have read access to your credentials:
chmod 600 ~/.kaggle/kaggle.json

To see all available API options and commands:

 kaggle --help

Step 5: Setup orchestration using Prefect

  1. Setup the prefect server so that you can access the UI. Run the following command in a CL terminal:
 prefect orion start
  1. Access the UI in your browser: http://127.0.0.1:4200/
  2. For the connection with GCP Buckets it is necessary to create a block:
  • In the side menu click on the option Blocks

  • Click on the '+' button and select the GCS Bucket option

  • Fill in the required fields: image

  • In the Gcp Credentials field click on the Add button

  • Fill in the Block Name field: image

  • Using the service account json file that was downloaded in step 2, copy its content and paste it in the Service Account Info field

  • Click on the Create button and you will be redirected to the previous GCS Bucket block creation page:

  • In the Gcp Credentials field select the Gcp credential created previously: image

  • Click on the Create button to create the block

  1. To execute the flow, run the following commands in a different CL terminal than step 1:
python prefect/flows/api_to_gcs_to_bq.py

Step 6: Running the dbt flow

  1. Create a dbt cloud free account
  2. Clone this repo
  3. In the command line of dbt running the following command:
dbt run

dbt lineage generated:

Validation of created tables

Production Table

Check Data in BigQuery:

  • The data will be available at dezoomcamp-finalproject.dbt_us_traffic_accidents
  • The production version will be available at dezoomcamp-finalproject.production.dim_us_traffic_accidents (dimension table) and dezoomcamp-finalproject.production.stg_us_traffic_accidents (staging table)

Improvements

  • Add unit tests
  • Add CI/CD pipeline
  • Containerize the project
  • Perform deeper data analysis

About

Data Engineering ZoomCamp Course Project


Languages

Language:Python 69.2%Language:HCL 30.8%