Project: United States Immigration

The project follows the follow steps:

Step 1: Scope the Project and Gather Data
Step 2: Explore and Assess the Data
Step 3: Define the Data Model
Step 4: Run ETL to Model the Data
Step 5: Complete Project Write Up

Datasets

I94 Immigration Data:

This data comes from the US National Tourism and Trade Office.
A data dictionary is included in the workspace.
There's a sample file so you can take a look at the data in csv format before reading it all in.
You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
https://i94.cbp.dhs.gov/I94/#/home
The immigration data and the global temperate data is in an attached disk.
You can access the immigration data in a folder with the following path: ../../data/18-83510-I94-Data-2016/.
There's a file for each month of the year. An example file name is i94_apr16_sub.sas7bdat.

What is a Form I-94?

Form I-94 is the DHS Arrival/Departure Record issued to aliens who are admitted to the U.S.,
who are adjusting status while in the U.S. or extending their stay, among other things.
A CBP officer generally attaches the I-94 to the non-immigrant visitor's passport upon U.S. entry.

U.S. City Demographic Data:

This data comes from OpenSoft.
This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000.
This data comes from the US Census Bureau's 2015 American Community Survey.

Schema Used For Analysis

Using the immigration and demographics datasets, we will create a star schema optimized for queries on immigration analysis. This includes the following tables.

Fact Table

Since we're interested in the flow of travellers through the united states. The i94 data will serve as our fact table.

1. immigration

-	Col	Description
1	cicid	Application number / Citizenship and Immigration C...
2	arrival_year	Arrival Year
3	arrival_month	Arrival Month
4	citizinship	Country Immigrant is Originally From (country of citizernship)
5	residence	Country of Immigrant Residence
6	port	AIR / SEAPORT of entry into the US ('XXX': 'NOT REPORTED/UNKNOWN' - '888': 'UNIDENTIFED AIR / SEAPORT' -'UNK': 'UNKNOWN POE')
7	port_city	port city
8	port_state	port state
9	arrival_date	Arrival Date to USA
10	travel_mode	(1: 'Air' - 2: 'Sea' - 3: 'Land' - 9: 'Not reported')
11	us_state	U.S. State / Address of Immigrant Inside USA ('99'='All Other Codes') actually representing the final address of the migrants, that is where they currently live in the US.
12	departure_date	Departure Date from the USA
13	age	Age of Respondent in Years
14	visa_category	Visa codes collapsed into three categories (Business - Pleasure - Student)
15	dep_issued_visa	Department of State where where Visa was issued - CIC does not use This is where your visa was issued. It will be a U.S. embassy or U.S. consulate.
16	visa_expiration_date	Character Date Field - Date to which admitted to U.S. (allowed to stay until) - CIC does not use visa expiration date
17	gender	Non-immigrant sex
18	airline	Airline used to arrive in U.S.
19	admission_number	Admission Number - An 11-digit number assigned to an alien when he enters the Unites States.
20	flight_number	Flight number of Airline used to arrive in U.S.
21	visa_type	VISATYPE - Class of admission legally admitting the non-immigrant to temporarily stay in U.S.

Dimension Tables

1. date - to aggregate the data suing various time units
  |-- arrdate: date (nullable = true)
  |-- arrival_day: integer (nullable = true)
  |-- arrival_week: integer (nullable = true)
  |-- arrival_month: integer (nullable = true)
  |-- arrival_year: integer (nullable = true)
  |-- arrival_weekday: integer (nullable = true)
1. demographics - To look at the demographic data of the areas with the most travelers
  |-- City: string (nullable = true)
  |-- State: string (nullable = true)
  |-- median_age: double (nullable = true)
  |-- male_population: integer (nullable = true)
  |-- female_population: integer (nullable = true)
  |-- total_population: integer (nullable = true)
  |-- n_veterans: integer (nullable = true)
  |-- foreign_born: integer (nullable = true)
  |-- avg_household_size: double (nullable = true)
  |-- state_code: string (nullable = true)
  |-- Race: string (nullable = true)
  |-- Count: integer (nullable = true)

Why Spark?

Consdiering the significant size of the immigration dataset (~ 3 million rows) for only a month, the most sensible technology choice for such an approach would be spark, especially if we were to process data over a longer period of time.
Apache spark was used because of:
- it's ability to handle multiple file formats with large amounts of data.
- Apache Spark offers a lightning-fast unified analytics engine for big data.
- Spark has easy-to-use APIs for operating on large datasets

Other Scenarios

IF the data was increased by 100x

Spark can handle the increase but we would consider increasing the number of nodes in our cluster.
We would still use spark as it as our data processing platform since it is the best suited platform for very large datasets.
Our data would be stored in an Amazon S3 bucket (instead of storing it in the EMR cluster along with the staging tables) and loaded to our staging tables.

IF the data populates a dashboard that must be updated on a daily basis by 7am every day.

We would use Apache Airflow to schedule and run data pipelines.

If the database needed to be accessed by 100+ people:

We would move our analytics database into Amazon Redshift
Once the data is ready to be consumed, it would be stored in a postgres database on a redshift cluster that easily supports multiuser access.

pe-mn / US-Immigration