weiwei-liu / data_engineering_capstone_project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I94 US Immigration Data Engineering

Udacity Data Engineering Nanodegree Capstone Project

Project Summary

This project creates a data lake for US Immigration data in the year 2016. This data lake will be used for data analytics and BI purpose. Currently data is stored at S3. A data model with star schema will be designed and a ETL pipeline will be built to extract data from S3, clean up using Spark, and load back to S3 with Parquet format. And this processed data could be used to build SQL database.

The project follows the follow steps:

  • Step 1: Scope the Project and Gather Data
  • Step 2: Explore and Assess the Data
  • Step 3: Define the Data Model
  • Step 4: Run ETL to Model the Data
  • Step 5: Complete Project Write Up

Step 1: Project Scope and Data Gathering

Scope

This project will create a data lake and clean up immigration, airport and demographics data collected from various sources, and load cleaned data back to S3 for a star schema database. This database could be used for data analyst to explore the relation among cities of US, point of entry and immigrant from different countries.

The main tools used in this project are Python, Pandas, Pyspark, SQL, S3, Airflow and Redshift.

Data Intro

  • I94 Immigration Data: This data comes from the US National Tourism and Trade Office.

    • Description:
      Detailed description can be found at I94_SAS_Labels_Descriptions.SAS. This file contains descriptions for the I94 meta data. This data contains 12 files for 12 months immigration data in 2016. Each file with the size of 500Mb and roughly has 3 million rows.

    • Sample:

cicid i94yr i94mon i94cit i94res i94port arrdate i94mode i94addr depdate i94bir i94visa count dtadfile visapost occup entdepa entdepd entdepu matflag biryear dtaddto gender insnum airline admnum fltno visatype
1397736 2.86358e+06 2016 4 689 689 MIA 20559 1 FL 20569 10 2 1 20160415 SPL nan G O nan M 2006 10142016 F nan JJ 9.36397e+10 08094 B2
997880 2.03818e+06 2016 4 245 245 SFR 20555 1 CA 20570 50 2 1 20160411 SHG nan G O nan M 1966 10102016 F nan MU 9.32671e+10 00589 B2
  • U.S. City Demographic Data
    • Description: This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau's 2015 American Community Survey.
    • Sample:
City State Median Age Male Population Female Population Total Population Number of Veterans Foreign-born Average Household Size State Code Race Count
629 Arden-Arcade California 41.5 47596 48680 96276 6511 13458 2.18 CA White 69369
1438 Edmond Oklahoma 32.5 45191 44899 90090 5006 5585
  • Airport Code Table

    • Description: This is a simple table of airport codes and corresponding cities.
    • Sample
ident type name elevation_ft continent iso_country iso_region municipality gps_code iata_code local_code coordinates
1613 16Z seaplane_base Mc Grath Seaplane Base 325 nan US US-AK Mcgrath 16Z nan 16Z -155.593002319, 62.9580001831
10090 9NC7 small_airport Willow Creek Airport 572 nan US US-NC Mt Pleasant 9NC7 nan 9NC7 -80.44000244140625, 35.36970138549805
  • ISO Country Code

    • Description: ISO 3166-1-alpha-2 English country names and code elements. This list states the country names (official short names in English) in alphabetical order as given in ISO 3166-1 and the corresponding ISO 3166-1-alpha-2 code elements. [ISO 3166-1].
    • Sample
Name Code
0 Afghanistan AF
1 Åland Islands AX

Step 2: Explore and Assess the Data

2.1 Explore the Data

Identify data quality issues, like missing values, duplicate data, etc.

2.2 Data Cleaning

Clean and transform data to make sure the data is ready for downstream extract and save.

Step 3: Define the Data Model

3.1 Conceptual Data Model

The main focus of this project is the immigration and demographics information. It would be natural to build a snowflake schema data model. The immigration and demographics table as the fact tables, and with dimensional tables like time, city, country, and airport.

3.2 Mapping Out Data Pipelines

  • Read in configuration settings (dl.cfg);
  • Using Spark df to read in raw data from S3, then clean and transformed data, finally select the coresponding data for each fact and dimensional table, and save table data onto S3 for future use.
  • Finally, data quality checks are run for each table to validate the output (key columns don't have nulls, record of each table is not 0,etc).

Step 4: Run Pipelines to Model the Data

4.1 Create the data model

Build the data pipelines to create the data model. (etl.py)

4.2 Data Quality Checks

Data quality checks ensure the pipeline ran as expected. Many of the Quality Checks were perfomed earlier to replace missing values, removing duplicates etc. Another important check is to make sure the primary key of the table is not null and unique before they are written back to S3.

4.3 Data dictionary

Create a data dictionary for the data model. For each field, a brief description of what the data is and where it came from are provided. (capstone.ipynb)

Step 5: Complete Project Write Up

5.1 Technology Choice

This project is designed to build a data lake for Analytics and BI teams to research the immigration pattern, country of origin, airlines and its relation with different state's demographics. Creating data lake on S3 seemed appropriate choice, it offers flexibility, efficiency and cost effectiveness. S3 data lake offers a convenient way for data scientist to use the data directly, or as a staging area for data engineers to build dimensional tables for REDSHIFT OR RDS.

Python, Spark framework and Pyspark are appropriate choice for implementation because of the library support. There could be millions of data records for the immigration data every month, to handle this amount of data it is proper to use distributed system like Spark. Besides, with the data growing each month, the need for a distributed system becomes more obvious.

ETL pipeline is scripted with etl.py, it can real immigration SAS files from S3 using spark session and after clean and transform, join with several other data sources, then the results are writen back to S3 in parquet format.

5.2 Data Update

The I94 immigration data could be updated daily, weekly, or monthly, depending on the analytics or BI purpose. If it is research analysis, probably monthly or yearly would be enough. If it is for BI monitoring purpose, it might need to be update daily.

However, other data is relatively more static (airport table, demographics table), they can be updated yearly or even serval year a time.

5.3 Different Scenarios Discussion:

  • The data was increased by 100x.

Right now there are around 6 gigabytes of data, and if the data increased by 100x, there would be 600Gb data. It could still be handled with distributed Spark system in EMR. If speed is the concern, the nodes or clusters could be increased easily with EMR.

  • The data populates a dashboard that must be updated on a daily basis by 7am every day.

Apache Airflow can be a proper solution for schedueled scripts runs.

  • The database needed to be accessed by 100+ people.

Build the database in Redshift could help to provide access for hundreds of people to access the database at the same. It says that Redshift can handle 500 concurrent connections to a Redshift cluster and maximum 15 queries can be run at the same time in a cluster. If more concurrent queries are needed, redshift provides Concurrency Scaling feature, where you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance.

About


Languages

Language:Jupyter Notebook 75.5%Language:SAS 18.0%Language:Python 6.5%