US Data Immigration Analysis - Shawarma Joint

Data Engineering Capstone Project

Project Summary

The project follows the follow steps:

Step 1: Scope the Project and Gather Data
Step 2: Explore and Assess the Data
Step 3: Define the Data Model
Step 4: Run ETL to Model the Data
Step 5: Complete Project Write Up

Step 1: Scope the Project and Gather Data

Business Case Research - Shawarma Joint

Avengers eating Shawarma: https://www.youtube.com/watch?v=EYiZeszLosE

Shawarma is the most popular Arabic street food / fast food. Therefore, the target is to conduct a study to open up a shawarma place in one of the airports that see the highest number of trips of visitors and students from Arabic countries.

Data Sources

List of Arabic Countries
Population Data from the WorldBank
List of US Cities and their Lat/Lng coordinates
Airport Codes

Note: I have imported lots of 3rd party data files, just to join them with the data for exploration purposes. I won't be using all these data sources though.

!pip install python-Levenshtein fuzzywuzzy pyshp geopandas

Collecting python-Levenshtein
�[?25l  Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB)
�[K    100% |████████████████████████████████| 51kB 2.5MB/s ta 0:00:011
�[?25hRequirement already satisfied: fuzzywuzzy in /opt/conda/lib/python3.6/site-packages (0.17.0)
Requirement already satisfied: pyshp in /opt/conda/lib/python3.6/site-packages (2.1.0)
Requirement already satisfied: geopandas in /opt/conda/lib/python3.6/site-packages (0.6.0)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from python-Levenshtein) (38.4.0)
Requirement already satisfied: pyproj in /opt/conda/lib/python3.6/site-packages (from geopandas) (2.4.0)
Requirement already satisfied: shapely in /opt/conda/lib/python3.6/site-packages (from geopandas) (1.6.4.post1)
Requirement already satisfied: fiona in /opt/conda/lib/python3.6/site-packages (from geopandas) (1.8.8)
Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.6/site-packages (from geopandas) (0.23.3)
Requirement already satisfied: click<8,>=4.0 in /opt/conda/lib/python3.6/site-packages (from fiona->geopandas) (6.7)
Requirement already satisfied: cligj>=0.5 in /opt/conda/lib/python3.6/site-packages (from fiona->geopandas) (0.5.0)
Requirement already satisfied: munch in /opt/conda/lib/python3.6/site-packages (from fiona->geopandas) (2.3.2)
Requirement already satisfied: attrs>=17 in /opt/conda/lib/python3.6/site-packages (from fiona->geopandas) (19.1.0)
Requirement already satisfied: click-plugins>=1.0 in /opt/conda/lib/python3.6/site-packages (from fiona->geopandas) (1.1.1)
Requirement already satisfied: six>=1.7 in /opt/conda/lib/python3.6/site-packages (from fiona->geopandas) (1.11.0)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.23.0->geopandas) (2.6.1)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas>=0.23.0->geopandas) (2017.3)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.23.0->geopandas) (1.12.1)
Building wheels for collected packages: python-Levenshtein
  Running setup.py bdist_wheel for python-Levenshtein ... �[?25ldone
�[?25h  Stored in directory: /root/.cache/pip/wheels/de/c2/93/660fd5f7559049268ad2dc6d81c4e39e9e36518766eaf7e342
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.0

Importing Libraries and Creating Spark Session

# Do all imports and installs here
import pandas as pd
import fuzzywuzzy
import shapefile
from shapely.geometry.polygon import LinearRing, Polygon, LineString

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import  pyspark.sql.functions as F
from pyspark.sql.types import StructType as R, StructField as Fld, DoubleType as Dbl, StringType as Str
from pyspark.sql.types import IntegerType as Int, DateType as Date

	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()

#df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_aug16_sub.sas7bdat')
sc = spark.sparkContext
spark.sparkContext.setLogLevel('ERROR')
spark.conf.set("spark.sql.shuffle.partitions", 20)

Step 2: Explore and Assess the Data

Reading and Exploring Temperature Data

fname = '../../data2/GlobalLandTemperaturesByCity.csv'
temp_df = pd.read_csv(fname)

round(temp_df[(temp_df['Country'] == 'United States') & (temp_df['City'] == 'Boston')].iloc[-12:]['AverageTemperature'].mean(), 1)

8.9000000000000004

Reading and Exploring Population Data

Downloaded from https://data.worldbank.org/indicator/SP.POP.TOTL?end=2018&start=2013

# pop_sizes_df = pd.read_csv('WorldBankPopulationSizeByCountry.csv')
pop_sizes_df = pop_sizes_df.rename({'Country Name':'CountryName'}, axis=1)
country_names = list(pop_sizes_df['CountryName'])
pop_sizes_df = pop_sizes_df.set_index('CountryName')
pop_sizes_df.head()

from fuzzywuzzy import process 
from fuzzywuzzy import fuzz

def findClosestCountryName(cn, cns= country_names, setOrSort=True, population=False):
    max_score = 0
    country = ''
    
    for n in cns:
        if setOrSort == True:
            score = fuzz.token_set_ratio(cn, n)
        else:
            score = fuzz.token_sort_ratio(cn, n)
        if score > max_score: 
            max_score = score
            country = n
            
    if population == True:
        return pd.Series([country, max_score, pop_sizes_df.loc[country]['2016']])
    return pd.Series([country, max_score])

print(findClosestCountryName('MEXICO Air Sea, and Not Reported (I-94, no land arrivals)', country_names)[0])
print(findClosestCountryName('CHINA, PRC', country_names)[0])

pop_sizes_df.head(3)

	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
CountryName
Aruba	ABW	Population, total	SP.POP.TOTL	54211.0	55438.0	56225.0	56695.0	57032.0	57360.0	57715.0	58055.0	58386.0	58726.0	59063.0	59440.0	59840.0	60243.0	60528.0	60657.0	60586.0	60366.0	60103.0	59980.0	60096.0	60567.0	61345.0	62201.0	62836.0	63026.0	62644.0	61833.0	61079.0	61032.0	62149.0	64622.0	68235.0	72504.0	76700.0	80324.0	83200.0	85451.0	87277.0	89005.0	90853.0	92898.0	94992.0	97017.0	98737.0	100031.0	100834.0	101222.0	101358.0	101455.0	101669.0	102046.0	102560.0	103159.0	103774.0	104341.0	104872.0	105366.0	105845.0
Afghanistan	AFG	Population, total	SP.POP.TOTL	8996973.0	9169410.0	9351441.0	9543205.0	9744781.0	9956320.0	10174836.0	10399926.0	10637063.0	10893776.0	11173642.0	11475445.0	11791215.0	12108963.0	12412950.0	12689160.0	12943093.0	13171306.0	13341198.0	13411056.0	13356511.0	13171673.0	12882528.0	12537730.0	12204292.0	11938208.0	11736179.0	11604534.0	11618005.0	11868877.0	12412308.0	13299017.0	14485546.0	15816603.0	17075727.0	18110657.0	18853437.0	19357126.0	19737765.0	20170844.0	20779953.0	21606988.0	22600770.0	23680871.0	24726684.0	25654277.0	26433049.0	27100536.0	27722276.0	28394813.0	29185507.0	30117413.0	31161376.0	32269589.0	33370794.0	34413603.0	35383128.0	36296400.0	37172386.0
Angola	AGO	Population, total	SP.POP.TOTL	5454933.0	5531472.0	5608539.0	5679458.0	5735044.0	5770570.0	5781214.0	5774243.0	5771652.0	5803254.0	5890365.0	6040777.0	6248552.0	6496962.0	6761380.0	7024000.0	7279509.0	7533735.0	7790707.0	8058067.0	8341289.0	8640446.0	8952950.0	9278096.0	9614754.0	9961997.0	10320111.0	10689250.0	11068050.0	11454777.0	11848386.0	12248901.0	12657366.0	13075049.0	13503747.0	13945206.0	14400719.0	14871570.0	15359601.0	15866869.0	16395473.0	16945753.0	17519417.0	18121479.0	18758145.0	19433602.0	20149901.0	20905363.0	21695634.0	22514281.0	23356246.0	24220661.0	25107931.0	26015780.0	26941779.0	27884381.0	28842484.0	29816748.0	30809762.0

Reading in US Cities Location Data

Reading and Exploring Airport Code Data

airport_codes = pd.read_csv('airport-codes_csv.csv')
airport_codes = airport_codes.fillna('')
airport_codes = airport_codes[(airport_codes['iso_country'] == 'US') & (airport_codes['type'] == 'large_airport')]
airport_codes['municipality'] = airport_codes['municipality'].apply(lambda x: x.lower())
airport_codes[airport_codes['municipality'] == 'orlando'].head()

	ident	type	name	elevation_ft	continent	iso_country	iso_region	municipality	gps_code	iata_code	local_code	coordinates
28001	KMCO	large_airport	Orlando International Airport	96		US	US-FL	orlando	KMCO	MCO	MCO	-81.30899810791016, 28.429399490356445
29937	KSFB	large_airport	Orlando Sanford International Airport	55		US	US-FL	orlando	KSFB	SFB	SFB	-81.23750305175781, 28.777599334716797

len(airport_codes)

Reading and Exploring Demographics Data

Reference: https://simplemaps.com/data/us-cities

us_cities = pd.read_csv('uscities.csv')
us_cities.head()

	city	city_ascii	state_id	state_name	county_fips	county_name	county_fips_all	county_name_all	lat	lng	population	density	source	military	incorporated	timezone	ranking	zips	id
0	South Creek	South Creek	WA	Washington	53053	Pierce	53053	Pierce	46.9994	-122.3921	2500.0	125.0	polygon	False	True	America/Los_Angeles	3	98580 98387 98338	1840116412
1	Roslyn	Roslyn	WA	Washington	53037	Kittitas	53037	Kittitas	47.2507	-121.0989	947.0	84.0	polygon	False	True	America/Los_Angeles	3	98941 98068 98925	1840097718
2	Sprague	Sprague	WA	Washington	53043	Lincoln	53043	Lincoln	47.3048	-117.9713	441.0	163.0	polygon	False	True	America/Los_Angeles	3	99032	1840096300
3	Gig Harbor	Gig Harbor	WA	Washington	53053	Pierce	53053	Pierce	47.3352	-122.5968	9507.0	622.0	polygon	False	True	America/Los_Angeles	3	98332 98335	1840097082
4	Lake Cassidy	Lake Cassidy	WA	Washington	53061	Snohomish	53061	Snohomish	48.0639	-122.0920	3591.0	131.0	polygon	False	True	America/Los_Angeles	3	98223 98258 98270	1840116371

demog = pd.read_csv('us-cities-demographics.csv', delimiter=';')
demog.head()

	City	State	Median Age	Male Population	Female Population	Total Population	Number of Veterans	Foreign-born	Average Household Size	State Code	Race	Count
0	Silver Spring	Maryland	33.8	40601.0	41862.0	82463	1562.0	30908.0	2.60	MD	Hispanic or Latino	25924
1	Quincy	Massachusetts	41.0	44129.0	49500.0	93629	4147.0	32935.0	2.39	MA	White	58723
2	Hoover	Alabama	38.5	38040.0	46799.0	84839	4819.0	8229.0	2.58	AL	Asian	4759
3	Rancho Cucamonga	California	34.5	88127.0	87105.0	175232	5821.0	33878.0	3.18	CA	Black or African-American	24437
4	Newark	New Jersey	34.6	138040.0	143873.0	281913	5829.0	86253.0	2.73	NJ	White	76402

Determining the biggest minority in each state - not sure where this might be useful

demog_race=demog.groupby(['State Code', 'Race']).agg({'Count':'sum', 'Median Age':'mean'})\
        .sort_values('Count', ascending=False)
demog_race = demog_race.reset_index()
demog_race = demog_race[demog_race['Race'] != 'White'].groupby(['State Code','Race']).agg({'Count':'max'})\
    .sort_values(['State Code', 'Count'], ascending=[True, False]).reset_index().set_index('State Code')

cols = ['State Code', 'Minority', 'Count']
biggest_minority = pd.DataFrame(columns = cols)

for i in list(set(demog_race.index)):
    biggest_minority = biggest_minority.append(pd.Series([i] + list(demog_race.loc[i].iloc[0].values), 
                                                         index=cols), ignore_index=True)
biggest_minority = biggest_minority.set_index('State Code')
biggest_minority.sort_values('Count', ascending=False).head()

	Minority	Count
State Code
CA	Hispanic or Latino	9856464
TX	Hispanic or Latino	6311431
NY	Hispanic or Latino	2730185
FL	Hispanic or Latino	1942022
AZ	Hispanic or Latino	1508157

Reading in Immigration Data

# Read in the data here
import os
data_dir = "../../data/18-83510-I94-Data-2016/"
files = os.listdir(data_dir)
for i in range(len(files)):
    files[i] = data_dir + files[i] 
files

dfs = []

for i in range(len(files)):
    dfs.append(spark.read.format('com.github.saurfang.sas.spark').load(files[i]))

Noticing that one of the months has 6 extra columns

dfs[4].limit(5).toPandas().columns

Index(['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count', 'validres', 'delete_days', 'delete_mexl', 'delete_dup', 'delete_visa', 'delete_recdup', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline', 'admnum', 'fltno', 'visatype'], dtype='object')

dfs[0].limit(5).toPandas().columns

Index(['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate',
       'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count',
       'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu',
       'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline',
       'admnum', 'fltno', 'visatype'],
      dtype='object')

cols = ['delete_days', 'delete_mexl', 'delete_dup', 'delete_visa', 'delete_recdup']

Evaluating the information in those 6 extra columns

display(dfs[4].limit(5).toPandas().head())
display(dfs[4].filter('validres != 1').limit(5).toPandas().head())

for c in cols:
    display(dfs[4].filter(c +' != 0').limit(5).toPandas().head())

	cicid	i94yr	i94mon	i94cit	i94res	i94port	arrdate	i94mode	i94addr	depdate	i94bir	i94visa	count	validres	dtadfile	visapost	occup	entdepa	entdepd	entdepu	matflag	biryear	dtaddto	gender	insnum	airline	admnum	fltno	visatype
0	4.0	2016.0	6.0	135.0	135.0	XXX	20612.0	None	None	None	59.0	2.0	1.0	1.0	None	None	None	Z	None	U	None	1957.0	10032016	None	None	None	1.493846e+10	None	WT
1	5.0	2016.0	6.0	135.0	135.0	XXX	20612.0	None	None	None	50.0	2.0	1.0	1.0	None	None	None	Z	None	U	None	1966.0	10032016	None	None	None	1.746006e+10	None	WT
2	6.0	2016.0	6.0	213.0	213.0	XXX	20609.0	None	None	None	27.0	3.0	1.0	1.0	None	None	None	T	None	U	None	1989.0	D/S	None	None	None	1.679298e+09	None	F1
3	7.0	2016.0	6.0	213.0	213.0	XXX	20611.0	None	None	None	23.0	3.0	1.0	1.0	None	None	None	T	None	U	None	1993.0	D/S	None	None	None	1.140963e+09	None	F1
4	16.0	2016.0	6.0	245.0	245.0	XXX	20632.0	None	None	None	24.0	3.0	1.0	1.0	None	None	None	T	None	U	None	1992.0	D/S	None	None	None	1.934535e+09	None	F1

	cicid	i94yr	i94mon	i94cit	i94res	i94port	arrdate	i94mode	i94addr	depdate	i94bir	i94visa	count	validres	delete_days	delete_mexl	delete_dup	delete_visa	delete_recdup	dtadfile	visapost	occup	entdepa	entdepd	entdepu	matflag	biryear	dtaddto	gender	insnum	airline	admnum	fltno	visatype

	cicid	i94yr	i94mon	i94cit	i94res	i94port	arrdate	i94mode	i94addr	depdate	i94bir	i94visa	count	validres	delete_days	delete_mexl	delete_dup	delete_visa	delete_recdup	dtadfile	visapost	occup	entdepa	entdepd	entdepu	matflag	biryear	dtaddto	gender	insnum	airline	admnum	fltno	visatype

	cicid	i94yr	i94mon	i94cit	i94res	i94port	arrdate	i94mode	i94addr	depdate	i94bir	i94visa	count	validres	delete_days	delete_mexl	delete_dup	delete_visa	delete_recdup	dtadfile	visapost	occup	entdepa	entdepd	entdepu	matflag	biryear	dtaddto	gender	insnum	airline	admnum	fltno	visatype

	cicid	i94yr	i94mon	i94cit	i94res	i94port	arrdate	i94mode	i94addr	depdate	i94bir	i94visa	count	validres	delete_days	delete_mexl	delete_dup	delete_visa	delete_recdup	dtadfile	visapost	occup	entdepa	entdepd	entdepu	matflag	biryear	dtaddto	gender	insnum	airline	admnum	fltno	visatype

	cicid	i94yr	i94mon	i94cit	i94res	i94port	arrdate	i94mode	i94addr	depdate	i94bir	i94visa	count	validres	delete_days	delete_mexl	delete_dup	delete_visa	delete_recdup	dtadfile	visapost	occup	entdepa	entdepd	entdepu	matflag	biryear	dtaddto	gender	insnum	airline	admnum	fltno	visatype

	cicid	i94yr	i94mon	i94cit	i94res	i94port	arrdate	i94mode	i94addr	depdate	i94bir	i94visa	count	validres	delete_days	delete_mexl	delete_dup	delete_visa	delete_recdup	dtadfile	visapost	occup	entdepa	entdepd	entdepu	matflag	biryear	dtaddto	gender	insnum	airline	admnum	fltno	visatype

Dropping extra columns - they don't contain any valuable information

dfs[4] = dfs[4].drop('validres')
for c in cols:
    dfs[4] = dfs[4].drop(c)

Joining all immigration data months

imm_df = dfs[0]

for i in range(1, len(dfs)):
    imm_df = imm_df.union(dfs[i])

imm_df.count()

40790529

Writing to Parquet

#write to parquet
imm_df.write.parquet("proc_sas_data")

Spark Processing Checkpoint: Read all processed Immigration Data

imm_df = spark.read.parquet("proc_sas_data")

Exploring the Data through grouping by various dimensions

top_ports = imm_df.groupby(['i94port']).count().orderBy(F.col('count'), ascending=False)
top_ports.coalesce(1).write.mode('overwrite').csv('ports.csv')
top_ports.show(5)

+-------+-------+
|i94port|  count|
+-------+-------+
|    NYC|6678555|
|    MIA|5122889|
|    LOS|4602847|
|    SFR|2309621|
|    HHW|2249967|
+-------+-------+
only showing top 5 rows

top_res = imm_df.groupby(['i94res']).count().orderBy(F.col('count'), ascending=False)
top_res.coalesce(1).write.mode('overwrite').csv('res.csv')
top_res.show(5)

+------+-------+
|i94res|  count|
+------+-------+
| 135.0|4587092|
| 209.0|3603786|
| 245.0|3049942|
| 582.0|2661125|
| 112.0|2046288|
+------+-------+
only showing top 5 rows

top_addr = imm_df.groupby(['i94addr']).count().orderBy(F.col('count'), ascending=False)
top_addr.coalesce(1).write.mode('overwrite').csv('addr.csv')
top_addr.show(5)

+-------+-------+
|i94addr|  count|
+-------+-------+
|     FL|8156192|
|     NY|6764396|
|     CA|6531491|
|     HI|2338444|
|   null|2027926|
+-------+-------+
only showing top 5 rows

top_mode = imm_df.groupby(['i94mode']).count().orderBy(F.col('count'), ascending=False)
top_mode.coalesce(1).write.mode('overwrite').csv('mode.csv')
to_mode.show(5)

top_visa = imm_df.groupby(['i94visa']).count().orderBy(F.col('count'), ascending=False)
top_visa.coalesce(1).write.mode('overwrite').csv('visa.csv')
top_visa.show(5)

+-------+--------+
|i94visa|   count|
+-------+--------+
|    2.0|33641979|
|    1.0| 5575279|
|    3.0| 1573271|
+-------+--------+

top_cit = imm_df.groupby(['i94cit']).count().orderBy(F.col('count'), ascending=False)
top_cit.coalesce(1).write.mode('overwrite').csv('countries.csv')
top_cit.show(5)

+------+-------+
|i94cit|  count|
+------+-------+
| 135.0|4531534|
| 209.0|3278033|
| 245.0|3128257|
| 582.0|2617070|
| 148.0|2051390|
+------+-------+
only showing top 5 rows

top_cit.join(imm_country_df, top_cit.i94cit == imm_country_df.Code).orderBy('count', ascending=False).show(10)

+------+-------+----+--------------------+--------------+-----+-------------+
|i94cit|  count|Code|             Country|   CountryName|Score|   Population|
+------+-------+----+--------------------+--------------+-----+-------------+
| 135.0|4531534| 135|      UNITED KINGDOM|United Kingdom|  100|  6.5595565E7|
| 209.0|3278033| 209|               JAPAN|         Japan|  100| 1.26994511E8|
| 245.0|3128257| 245|          CHINA, PRC|         China|  100|   1.378665E9|
| 582.0|2617070| 582|MEXICO Air Sea, a...|        Mexico|  100| 1.23333376E8|
| 111.0|1679312| 111|              FRANCE|        France|  100|  6.6859768E7|
| 689.0|1672212| 689|              BRAZIL|        Brazil|  100| 2.06163058E8|
| 438.0|1325861| 438|           AUSTRALIA|     Australia|  100|  2.4190907E7|
| 213.0|1252212| 213|               INDIA|         India|  100|1.324509589E9|
| 117.0|1116790| 117|               ITALY|         Italy|  100|  6.0627498E7|
| 129.0| 895509| 129|               SPAIN|         Spain|  100|  4.6483569E7|
+------+-------+----+--------------------+--------------+-----+-------------+
only showing top 10 rows

Step 3: Define the Data Model

3.1 Conceptual Data Model

The data model will track all Arabic nationalities by country and by port of entry. For that, we will need to first group all visitors by nationality, and port of entry. I've also added grouping by country of residence, and visa type, just in case they will become useful later. The visitors data frame will then be joined with country information, and with port of entry information. The final data output, will be the list of port of entries, with the count of all Arab nationals, in addition to the name of the city, state, and the GPS coordinates of the city.

3.1 Conceptual Data Model

Process immigration dictionary to help figure out immigration column fields
Process list of Arab countries
Start processing immigration DataFrame and join relevant data

Step 4: Run Pipelines to Model the Data

Process Immigration Dictionary

The I94_SAS_Labels_Descriptions.SAS file has been copied into an excel sheet, and relevant field dictionaries will be processed

imm_dict_country = pd.read_excel('imm_dictionary.xlsx', 'Country', header=None)
imm_dict_country[['Code', 'Country']] = imm_dict_country[0].apply(lambda x: pd.Series(x.strip().replace("'", "")\
                                                                                       .split("=")))
imm_dict_country['Code'] = imm_dict_country['Code'].apply(lambda x: int(x.strip()))
imm_dict_country['Country'] = imm_dict_country['Country'].apply(lambda x: x.strip())
imm_dict_country[['CountryName', 'Score', 'Population']] = imm_dict_country['Country'].apply(lambda x: findClosestCountryName(x, 
                                                                                                    country_names, population=True))
imm_dict_country = imm_dict_country.drop(0, axis=1)
imm_country_df = spark.createDataFrame(imm_dict_country)

imm_dict_country.head()

	Code	Country	CountryName	Score	Population
0	582	MEXICO Air Sea, and Not Reported (I-94, no lan...	Mexico	100	123333376.0
1	236	AFGHANISTAN	Afghanistan	100	35383128.0
2	101	ALBANIA	Albania	100	2876101.0
3	316	ALGERIA	Algeria	100	40551404.0
4	102	ANDORRA	Andorra	100	77297.0

city_names = list(us_cities['city_ascii'])
findClosestCountryName('Alcan', city_names)[0]

'Alcan Border'

imm_dict_port = pd.read_excel('imm_dictionary.xlsx', 'Port', header=None)
imm_dict_port['Code'] = imm_dict_port[0].apply(lambda x: x.strip().replace("'", ""))
imm_dict_port['Port'] = imm_dict_port[2].apply(lambda x: x.strip().replace("'", ""))

# imm_dict_port[['City'] = ''
# imm_dict_port[['City', 'State']] = ''

def splitCityState(x):
    y = x.strip().split(',')
    
    yy = []
    
    for z in y:
        yy.append(z.strip())

    if len(yy) < 2: 
        return pd.Series([yy[0], ''])
    if len(yy) > 2: 
        return pd.Series([yy[0], ', '.join(yy[1:len(y)])])
    return pd.Series(yy)

imm_dict_port[['City', 'State']] = imm_dict_port['Port'].apply(lambda x: splitCityState(x))
imm_dict_port = imm_dict_port.drop([0, 1, 2, 3], axis=1)
imm_dict_port['City'] = imm_dict_port['City'].apply(lambda x: x.capitalize())

city_names = list(us_cities['city_ascii'])

def identify_city_name(row):
    city = row['City']
    state = row['State']
    
    city_names = list(us_cities[us_cities['state_id'] == state]['city_ascii'])
    return findClosestCountryName(city, city_names, setOrSort=False)[0]
    

imm_dict_port['matched_city'] = ''
imm_dict_port['matched_city'] = imm_dict_port.apply(lambda row: identify_city_name(row), axis=1)
imm_dict_port = imm_dict_port.merge(us_cities,left_on=['matched_city', 'State'], right_on=['city_ascii', 'state_id'])
imm_dict_port = imm_dict_port.drop(['City'], axis=1)
print(len(imm_dict_port))
imm_dict_port = imm_dict_port.fillna('')
imm_port_df = spark.createDataFrame(imm_dict_port)
#imm_port_df.limit(50).toPandas().head()
#imm_dict_port[imm_dict_port['city_ascii'].isnull()].head()
imm_dict_port.head()

	Code	Port	State	matched_city	city	city_ascii	state_id	state_name	county_fips	county_name	county_fips_all	county_name_all	lat	lng	population	density	source	military	incorporated	timezone	ranking	zips	id
0	ALC	ALCAN, AK	AK	Alatna	Alatna	Alatna	AK	Alaska	2290	Yukon-Koyukuk	02290	Yukon-Koyukuk	66.5638	-152.8392	0.0	0.0	polygon	False	False	America/Anchorage	3	99720	1840114044
1	ANC	ANCHORAGE, AK	AK	Anchorage	Anchorage	Anchorage	AK	Alaska	2020	Anchorage	02020	Anchorage	61.1508	-149.1091	253421.0	66.0	polygon	False	True	America/Anchorage	2	99518 99515 99517 99516 99513 99540 99567 9958...	1840089974
2	BAR	BAKER AAF - BAKER ISLAND, AK	AK	Point Baker	Point Baker	Point Baker	AK	Alaska	2198	Prince of Wales-Hyder	02198	Prince of Wales-Hyder	56.3482	-133.6167	22.0	9.0	polygon	False	False	America/Sitka	3	99927	1840114092
3	DAC	DALTONS CACHE, AK	AK	Nondalton	Nondalton	Nondalton	AK	Alaska	2164	Lake and Peninsula	02164	Lake and Peninsula	59.9711	-154.8626	132.0	7.0	polygon	False	True	America/Anchorage	3	99640	1840090141
4	PIZ	DEW STATION PT LAY DEW, AK	AK	Attu Station	Attu Station	Attu Station	AK	Alaska	2016	Aleutians West	02016	Aleutians West	52.8955	173.1230	16.0	0.0	polygon	False	True	America/Adak	3		1840114050

imm_dict_states = pd.read_excel('imm_dictionary.xlsx', 'States', header=None)
imm_dict_states[['Code', 'State']] = imm_dict_states[0].apply(lambda x: pd.Series(x.strip().replace("'", "")\
                                                                                       .split("=")))
imm_dict_states = imm_dict_states.drop(0, axis=1)
imm_state_df = spark.createDataFrame(imm_dict_states)

imm_dict_states.head()

	Code	State
0	AL	ALABAMA
1	AK	ALASKA
2	AZ	ARIZONA
3	AR	ARKANSAS
4	CA	CALIFORNIA

Reading in List of Arabic Countries

Reference: https://www.downloadexcelfiles.com/wo_en/download-excel-file-list-arab-countries#.XZm_fuczZTY

arabic_countries = pd.read_csv('list-arab-countries-439j.csv')
arabic_countries = list(arabic_countries['Country (or dependent territory)'])

imm_countries = list(set(imm_dict_country['Country']))

arabic_countries_dict = {}

for c in arabic_countries: 
    match = findClosestCountryName(c, imm_countries, setOrSort=False)[0]
    print(c, match, imm_dict_country[imm_dict_country['Country'] == match]['Code'].values[0])
    arabic_countries_dict[c] = imm_dict_country[imm_dict_country['Country'] == match]['Code'].values[0]

Egypt EGYPT 368
Algeria ALGERIA 316
Iraq IRAQ 250
Sudan SUDAN 350
Morocco MOROCCO 332
Saudi Arabia SAUDI ARABIA 261
Yemen YEMEN 216
Syria SYRIA 262
Tunisia TUNISIA 323
Somalia SOMALIA 397
United Arab Emirates UNITED ARAB EMIRATES 296
Jordan JORDAN 253
Libya LIBYA 381
Palestine PALESTINE 743
Lebanon LEBANON 255
Oman OMAN 256
Kuwait KUWAIT 272
Mauritania MAURITANIA 389
Qatar QATAR 297
Bahrain BAHRAIN 298
Djibouti DJIBOUTI 322
Comoros COMOROS 317

visitors = imm_df.groupby(['i94cit', 'i94res', 'I94PORT', 'i94visa']).count()
visitors = visitors.join(imm_country_df, visitors['i94cit'] == imm_country_df['Code'])\
                .selectExpr('*', "CountryName as CitCountry")\
                .selectExpr('*', "Population as CitPopulation").drop('CountryName').drop('Code')\
                    .drop('Population').drop('Score').drop('Country')
visitors = visitors.join(imm_country_df, visitors['i94res'] == imm_country_df['Code'])\
                .selectExpr('*', 'CountryName as ResCountry')\
                .selectExpr('*', "Population as ResPopulation").drop('Country')\
                .drop('Population').drop('CountryName').drop('Code').drop('Score')

visitors = visitors.join(imm_port_df, visitors['i94port'] == imm_port_df['Code']).drop('key_0').drop('Code')
visitors = visitors.orderBy(F.col('count'), ascending=False)
visitors.coalesce(1).write.mode('overwrite').csv('visitors.csv')
visitors.limit(15).toPandas().head()

	i94cit	i94res	I94PORT	i94visa	count	CitCountry	CitPopulation	ResCountry	ResPopulation	Port	State	matched_city	city	city_ascii	state_id	state_name	county_fips	county_name	county_fips_all	county_name_all	lat	lng	population	density	source	military	incorporated	timezone	ranking	zips	id
0	209.0	209.0	HHW	2.0	1429900	Japan	126994511.0	Japan	126994511.0	HONOLULU, HI	HI	Honolulu	Honolulu	Honolulu	HI	Hawaii	15003	Honolulu	15003	Honolulu	21.3294	-157.8460	833671.0	2234.0	polygon	False	True	Pacific/Honolulu	2	96859 96850 96822 96826 96813 96815 96814 9681...	1840118304
1	135.0	135.0	NYC	2.0	773579	United Kingdom	65595565.0	United Kingdom	65595565.0	NEW YORK, NY	NY	New York	New York	New York	NY	New York	36061	New York	36061	New York	40.6943	-73.9249	19354922.0	11083.0	polygon	False	True	America/New_York	1	11229 11226 11225 11224 11222 11221 11220 1138...	1840059961
2	135.0	135.0	ORL	2.0	630968	United Kingdom	65595565.0	United Kingdom	65595565.0	ORLANDO, FL	FL	Orlando	Orlando	Orlando	FL	Florida	12095	Orange	12095	Orange	28.4772	-81.3369	1776841.0	982.0	polygon	False	True	America/New_York	1	32829 32827 32824 32822 32804 32805 32806 3280...	1840012172
3	689.0	689.0	MIA	2.0	536911	Brazil	206163058.0	Brazil	206163058.0	MIAMI, FL	FL	Miami	Miami	Miami	FL	Florida	12086	Miami-Dade	12086	Miami-Dade	25.7839	-80.2102	6381966.0	4969.0	polygon	False	True	America/New_York	1	33129 33125 33126 33127 33128 33149 33144 3314...	1840012834
4	438.0	438.0	LOS	2.0	499750	Australia	24190907.0	Australia	24190907.0	LOS ANGELES, CA	CA	Los Angeles	Los Angeles	Los Angeles	CA	California	6037	Los Angeles	06037	Los Angeles	34.1139	-118.4068	12815475.0	3295.0	polygon	False	True	America/Los_Angeles	1	90291 90293 90292 91316 91311 90037 90031 9000...	1840107920

visitors[visitors['CitCountry'] == 'Saudi Arabia'].orderBy(F.col('count'), ascending=False).limit(15).toPandas().head()

	i94cit	i94res	I94PORT	i94visa	count	CitCountry	CitPopulation	ResCountry	ResPopulation	Port	State	matched_city	city	city_ascii	state_id	state_name	county_fips	county_name	county_fips_all	county_name_all	lat	lng	population	density	source	military	incorporated	timezone	ranking	zips	id
0	261.0	261.0	NYC	2.0	31515	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	NEW YORK, NY	NY	New York	New York	New York	NY	New York	36061	New York	36061	New York	40.6943	-73.9249	19354922.0	11083.0	polygon	False	True	America/New_York	1	11229 11226 11225 11224 11222 11221 11220 1138...	1840059961
1	261.0	261.0	LOS	2.0	28888	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	LOS ANGELES, CA	CA	Los Angeles	Los Angeles	Los Angeles	CA	California	6037	Los Angeles	06037	Los Angeles	34.1139	-118.4068	12815475.0	3295.0	polygon	False	True	America/Los_Angeles	1	90291 90293 90292 91316 91311 90037 90031 9000...	1840107920
2	261.0	261.0	NYC	3.0	16913	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	NEW YORK, NY	NY	New York	New York	New York	NY	New York	36061	New York	36061	New York	40.6943	-73.9249	19354922.0	11083.0	polygon	False	True	America/New_York	1	11229 11226 11225 11224 11222 11221 11220 1138...	1840059961
3	261.0	261.0	CHI	3.0	14948	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	CHICAGO, IL	IL	Chicago	Chicago	Chicago	IL	Illinois	17031	Cook	17031	Cook	41.8373	-87.6862	8675982.0	4612.0	polygon	False	True	America/Chicago	1	60018 60649 60641 60640 60643 60642 60645 6064...	1840021521
4	261.0	261.0	LOS	3.0	13102	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	LOS ANGELES, CA	CA	Los Angeles	Los Angeles	Los Angeles	CA	California	6037	Los Angeles	06037	Los Angeles	34.1139	-118.4068	12815475.0	3295.0	polygon	False	True	America/Los_Angeles	1	90291 90293 90292 91316 91311 90037 90031 9000...	1840107920

vis_df = visitors.toPandas()
vis_df['flag'] = False
conditions = vis_df['flag']
vis_df = vis_df.drop('flag', axis=1)

for ac in arabic_countries_dict.keys():
    conditions = conditions | (vis_df['i94cit'] == arabic_countries_dict[ac])

ac_vis_df = vis_df[conditions]
ac_vis_df.head(5)

	i94cit	i94res	I94PORT	i94visa	count	CitCountry	CitPopulation	ResCountry	ResPopulation	Port	State	matched_city	city	city_ascii	state_id	state_name	county_fips	county_name	county_fips_all	county_name_all	lat	lng	population	density	source	military	incorporated	timezone	ranking	zips	id
194	261.0	261.0	NYC	2.0	31515	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	NEW YORK, NY	NY	New York	New York	New York	NY	New York	36061	New York	36061	New York	40.6943	-73.9249	19354922.0	11083.0	polygon	False	True	America/New_York	1	11229 11226 11225 11224 11222 11221 11220 1138...	1840059961
203	261.0	261.0	LOS	2.0	28888	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	LOS ANGELES, CA	CA	Los Angeles	Los Angeles	Los Angeles	CA	California	6037	Los Angeles	06037	Los Angeles	34.1139	-118.4068	12815475.0	3295.0	polygon	False	True	America/Los_Angeles	1	90291 90293 90292 91316 91311 90037 90031 9000...	1840107920
217	368.0	368.0	NYC	2.0	26875	Egypt, Arab Rep.	94447072.0	Egypt, Arab Rep.	94447072.0	NEW YORK, NY	NY	New York	New York	New York	NY	New York	36061	New York	36061	New York	40.6943	-73.9249	19354922.0	11083.0	polygon	False	True	America/New_York	1	11229 11226 11225 11224 11222 11221 11220 1138...	1840059961
315	272.0	272.0	NYC	2.0	18534	Kuwait	3956873.0	Kuwait	3956873.0	NEW YORK, NY	NY	New York	New York	New York	NY	New York	36061	New York	36061	New York	40.6943	-73.9249	19354922.0	11083.0	polygon	False	True	America/New_York	1	11229 11226 11225 11224 11222 11221 11220 1138...	1840059961
328	261.0	261.0	NYC	3.0	16913	Saudi Arabia	32442572.0	Saudi Arabia	32442572.0	NEW YORK, NY	NY	New York	New York	New York	NY	New York	36061	New York	36061	New York	40.6943	-73.9249	19354922.0	11083.0	polygon	False	True	America/New_York	1	11229 11226 11225 11224 11222 11221 11220 1138...	1840059961

len(ac_vis_df)

sf = shapefile.Reader('./us-shapefile/gz_2010_us_040_00_500k.shp')
us_states_geo_df = pd.DataFrame(columns=['Name', 'Shape'])

for s in sf.shapeRecords():
    sp = s.__geo_interface__['properties']
    sg = s.__geo_interface__['geometry']
    
    if sg['type'] == 'MultiPolygon':
        polygons = []
        for p in sg['coordinates']:
            polygons.append(Polygon(list(p[0])))

        state_pol = cascaded_union(polygons)
    else:
        state_pol = Polygon(sg['coordinates'][0])
    
    us_states_geo_df = us_states_geo_df.append({'Name':sp['NAME'], 'Shape':state_pol},
                                                  ignore_index=True)
us_states_geo_df.head()

	Name	Shape
0	Maine	(POLYGON ((-70.6078338623047 42.9777641296387,...
1	Massachusetts	(POLYGON ((-70.81141662597659 41.249870300293,...
2	Michigan	(POLYGON ((-83.8292236328125 43.6626319885254,...
3	Montana	POLYGON ((-104.057698 44.997431, -104.250145 4...
4	Nevada	POLYGON ((-114.0506 37.00039599999999, -114.04...

state_counts = ac_vis_df.groupby('State').agg({'count':'sum'}).sort_values('count', ascending=False)
state_counts = state_counts.reset_index()
state_counts = state_counts.rename({'State':'State_Code', 'count':'StateCount'}, axis =1)
state_counts.head(5)

	State_Code	StateCount
0	NY	180561
1	CA	122553
2	IL	62470
3	FL	46369
4	TX	40002

hi_vol_ports = ac_vis_df.groupby(['I94PORT', 'city', 'State', 'state_name', 'lat', 'lng']).agg({'count':'sum'}).sort_values('count', ascending=False)
hi_vol_ports = hi_vol_ports.reset_index()
hi_vol_ports = hi_vol_ports.merge(us_states_geo_df, left_on='state_name', right_on='Name')
hi_vol_ports = hi_vol_ports.merge(state_counts, left_on='State', right_on='State_Code')
hi_vol_ports = hi_vol_ports.drop(['State_Code', 'state_name'], axis=1)
hi_vol_ports.to_csv('hi_vol_ports.csv')
hi_vol_ports.head()

	I94PORT	city	State	lat	lng	count	Name	Shape	StateCount
0	NYC	New York	NY	40.6943	-73.9249	170329	New York	(POLYGON ((-71.943563 41.286675, -71.926802380...	180561
1	CHM	Champlain	NY	44.9882	-73.4408	4271	New York	(POLYGON ((-71.943563 41.286675, -71.926802380...	180561
2	PBB	Central Bridge	NY	42.7068	-74.3473	1812	New York	(POLYGON ((-71.943563 41.286675, -71.926802380...	180561
3	NIA	Niagara Falls	NY	43.0921	-79.0147	1706	New York	(POLYGON ((-71.943563 41.286675, -71.926802380...	180561
4	LEW	Lewiston	NY	43.1724	-79.0400	1095	New York	(POLYGON ((-71.943563 41.286675, -71.926802380...	180561

Conclusion - Insight

Given the geography, it makes sense that NYC is the port of entry with the highest volume of Arab nationals. Therefore, we must find the international airports in NYC, and pick the busiest one. I94Port is the CBP location code for Immigration purposes, and unfortunately, might include multiple international airports.

airport_codes[airport_codes['municipality'] == 'new york'].head()

	ident	type	name	elevation_ft	iso_country	iso_region	municipality	gps_code	iata_code	local_code	coordinates
27679	KJFK	large_airport	John F Kennedy International Airport	13	US	US-NY	new york	KJFK	JFK	JFK	-73.77890015, 40.63980103
27819	KLGA	large_airport	La Guardia Airport	21	US	US-NY	new york	KLGA	LGA	LGA	-73.87259674, 40.77719879
49898	US-0883	large_airport	JFK		US	US-NY	new york				0, 0

In this case, we can infer from the above that there two airports in the NYC I94Port Port of Entry definition. Some online searching will yield that JFK has more international flights, while La Guardia is more geared towards internal flights.

Winner Airport

Therefore the winnder airport is JFK for our modest Shawarma joint.

Visualizing in Kepler.gl

The saved file "hi_vol_ports.csv" can be uploaded to kepler.gl, and with a little bit of manual configuration, further geospatial analysis can be conducted. The below link accesses a map with the final dataframe data, as well as configured layers.

The map could be accessed at: https://kepler.gl/demo/map?mapUrl=https://dl.dropboxusercontent.com/s/dpvk9xkud4kqf1h/keplergl_m41qnas.json

4.2 Data Quality Checks

Run Quality Checks

# Perform quality checks here
print(len(set(ac_vis_df['CitCountry'])) == len(set(arabic_countries)))

for ac in arabic_countries_dict.keys():
    print(visitors[visitors['i94cit'] == str(arabic_countries_dict[ac])].count() ==\
          len(ac_vis_df[ac_vis_df['i94cit'] == arabic_countries_dict[ac]]))

True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True

4.3 Data dictionary

Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

The list of columns in the final processed data frame is below:

I94PORT: CBP location code of the port of entry
City: name of the city of the port of entry
State: code of the state of the port of entry
Name: full name of the state of the port of entry
lat: Latitude of the city of the port of entry
lng: Longitude of the city of the port of entry
count: count of Arab travellers per port of entry
StateCount: count of Arab travellers per state
Shape: shape of the state for mapping purposes

Step 5: Complete Project Write Up

Clearly state the rationale for the choice of tools and technologies for the project.
Propose how often the data should be updated and why.
Write a description of how you would approach the problem differently under the following scenarios:
The data was increased by 100x.
The data populates a dashboard that must be updated on a daily basis by 7am every day.
The database needed to be accessed by 100+ people.
Most of the included data sources have helped in finding the right airport. The geospatial analysis helped identify 4 clusters or major points of interests, mainly the Northeast (New York and surroundings), Southern California (LA), Northen California (SF), and Texas (Houston)
The immigration data should be updated yearly (or at the same frequency that the right authority releases the data)
If the data was increased by 100x, I would increase the number of nodes in the Spark Cluster. After grouping the immigration data, the analysis could be continued with Pandas, the same way that it has been done here.
If the data needed to populate a dashboard on a daily basis, Apache Airflow would help, as well as an online database where the dashboard can draw the data from.
If the database need to be accessed by 100+ people, this is relatively low number, and could be handled by any database. But if this number becomes too large, then maybe hosting on Redshift or an online version of Cassandra would help with the load.

samerelhousseini / US-Immigration-Data-Analysis-Shawarma-Joint

US Data Immigration Analysis - Shawarma Joint

Data Engineering Capstone Project

Project Summary

Step 1: Scope the Project and Gather Data

Business Case Research - Shawarma Joint

Data Sources

Importing Libraries and Creating Spark Session

Step 2: Explore and Assess the Data

Reading and Exploring Temperature Data

Reading and Exploring Population Data

Downloaded from https://data.worldbank.org/indicator/SP.POP.TOTL?end=2018&start=2013

Reading in US Cities Location Data

Reading and Exploring Airport Code Data

Reading and Exploring Demographics Data

Determining the biggest minority in each state - not sure where this might be useful

Reading in Immigration Data

Noticing that one of the months has 6 extra columns

Evaluating the information in those 6 extra columns

Dropping extra columns - they don't contain any valuable information

Joining all immigration data months

Writing to Parquet

Spark Processing Checkpoint: Read all processed Immigration Data

Exploring the Data through grouping by various dimensions

Step 3: Define the Data Model

3.1 Conceptual Data Model

3.1 Conceptual Data Model

Step 4: Run Pipelines to Model the Data

Process Immigration Dictionary

Reading in List of Arabic Countries

Conclusion - Insight

Winner Airport

Visualizing in Kepler.gl

4.2 Data Quality Checks

4.3 Data dictionary

Step 5: Complete Project Write Up

About

Languages