rayleegit / dataEngineer

Data Engineering Sample

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Engineering Project

What problem will we solve?

We will look at the most populous U.S. cities and identify which ones are the most expensive and the most affordable to live in. This will help us decide which city we'd like to move to next.

What datasets will we use?

We will scrape three datasets:

  1. Wikipedia List of United States cities by population data, which lists the most populous U.S. cities

  2. Zillow Home Value Index data, an estimate of median home values by city.

  3. Wikipedia Household income in the United States data, which lists 2017 median household income by state

How will we use these datasets to solve the problem?

We will append (2) median home value data and (3) household income data to the (1) list of most populous U.S. cities, and then calculate a "Cost Score", which shows for each city how many years of income is required to purchase a median value home. This tells us, relative to other cities, how costly it is to live in a particular city. We will then create a "Cost Rank" based on this score.

What steps will we take to do this?

We will:

I. Scrape the Data

II. Join the Data

III. Analyze the Data

For step by step code, refer to 'DataEngineering.ipynb'. The output of this notebook code is 'topCitiesJoined.csv', whcih lists the most populous U.S. cities as well as their 'Cost Score' and 'Cost Rank'. This CSV is ready to be uploaded to a BigQuery table.

NOTE: to run this notebook, you will need to have Anaconda Distribution with Python 3.7 installed.

About

Data Engineering Sample


Languages

Language:Jupyter Notebook 100.0%